Even with the breakthroughs in semiconductor technology that will enable billion transistor designs, hardwarebased architecture paradigms alone cannot substantially improve processor performance. The challenge in realizing the full potential of these future machines is to find ways to adapt program behavior to application needs and processor resources. As such, run-time optimization will have a distinct role in future high performance systems. However, as these systems are dependent on accurate, fine-grain profile information, traditional approaches to collecting profiles at run-time result in significant slowdowns during program execution.
A novel approach to low-overhead profiling is to exploit hardware Performance Monitoring Units (PMUs) present in modern microprocessors. The Itanium-2 PMU can periodically sample the last few taken branches in an executing program and this information can be used to recreate partial paths of execution. With compiler-aided analysis, the partial paths can be correlated into full paths. As statistically hot paths are most likely to occur in PMU samples, even infrequent sampling can accurately identify these paths. While traditional path profiling techniques carry a high overhead, a PMU-based path profiler represents an effective lightweight profiling alternative. This paper characterizes the PMU-based path information and demonstrates the construction of such a PMU-based path profiler for a run-time system.
Code coverage analysis, the process of finding code exercised by a particular set of test inputs, is an important component of software development and verification. Most traditional methods of implementing code coverage analysis tools are based on program instrumentation. These methods typically incur high overhead due to the insertion and execution of instrumentation code, and are not deployable in many software environments. Hardware-based sampling techniques attempt to lower overhead by leveraging existing Hardware Performance Monitoring (HPM) support for program counter (PC) sampling. While PC-sampling incurs lower levels of overhead, it does not provide complete coverage information. This paper extends the HPM approach in two ways. First, it utilizes the sampling of branch vectors which are supported on modern processors. Second, compiler analysis is performed on branch vectors to extend the amount of code coverage information derived from each sample. This paper shows that although HPM is generally used to guide performance improvement efforts, there is substantial promise in leveraging the HPM information for code debugging and verification. The combination of sampled branch vectors and compiler analysis can be used to attain upwards of 80% of the actual code coverage.
Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS timeinterrupts, or static-compiler techniques. However, substantially greater gains can be realized when control opportunities are also explored in a dynamic compilation environment. There are several advantages to deploying DVFS and managing energy/performance tradeoffs through the use of a dynamic compiler. Most importantly, dynamic compiler driven DVFS is fine-grained, code-aware, and adaptive to the current microarchitecture environment. This paper presents a design framework of the run-time DVFS optimizer in a general dynamic compilation system. A prototype of the DVFS optimizer isimplemented and integrated into an industrialstrength dynamic compilation system. The obtained optimization system is deployed in a real hardware platform that directly measures CPU voltage and current for accurate power and energy readings. Experimental results, based on physical measurements for over 40 SPEC or Olden benchmarks, show that significant energy savings are achieved with little performance degradation. SPEC2K FP benchmarks benefit with energy savings of up to 70% (with 0.5% performance loss). In addition, SPEC2K INT show up to 44% energy savings (with 5% performance loss), SPEC95 FP save up to 64% (with 4.9% performance loss), and Olden save up to 61% (with 4.5% performance loss). On average, the technique leads to an energy delay product (EDP) improvement that is 3X-5X better than static voltage scaling, and is more than 2X (22% vs. 9%) better than the reported DVFS results of prior static compiler work. While the proposed technique is an effective method for microprocessor voltage and frequency control, the design framework and methodology described in this paper have broader potential to address other energy and power issues such as di/dt and thermal control.
Future computer systems will integrate tens of multithreaded processor cores on a single chip die, resulting in hundreds of concurrent program threads sharing system resources. These designs will be the cornerstone of improving throughput in high-performance computing and server environments. However, to date, appropriate systems software (operating system, run-time system, and compiler) technologies for these emerging machines have not been adequately explored. Future processors will require sophisticated hardware monitoring units to continuously feed back resource utilization information to allow the operating system to make optimal thread co-scheduling decisions and also to software that continuously optimizes the program itself. Nevertheless, in order to continually and automatically adapt systems resources to program behaviors and application needs, specific run-time information must be collected to adequately enable dynamic code optimization and operating system scheduling. Generally, run-time optimization is limited by the time required to collect profiles, the time required to perform optimization, and the inherent benefits of any optimization or decisions. Initial techniques for effectively utilizing runtime information for dynamic optimization and informed thread scheduling in future multithreaded architectures are presented.
Dynamic code transformation systems (DCTS) can broadly be grouped into three distinct categories: optimization, translation and instrumentation. All of these face the critical challenge of minimizing the overhead incurred during transformation since their execution is interleaved with the execution of the application itself. The common DCTS tasks incurring overhead are the identification of frequently executed code sequences, costly analysis of program information, and run-time creation (writing) of new code sequences. The cost of such work is amortized by the repeated execution of the transformed code. However, as these steps are applied to all general code regions (regardless of their execution frequency and characteristics), there is substantial overhead that impacts the application’s performance. As such, it is challenging to effectively deploy dynamic transformation under fixed performance constraints. This paper explores a technique for eliminating the overhead incurred by exploiting persistent application execution characteristics that are shared across different application invocations. This technique is implemented and evaluated in Pin, a dynamic instrumentation engine. This version of Pin is referred to as Persistent Pin (PPin). Initial PPin experimental results indicate that using information from prior runs can reduce dynamic instrumentation overhead of SPEC applications by as much as 25% and over 90% for everyday applications like web browsers, display rendering systems, and spreadsheet programs.
Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin’s rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application’s original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin’s versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), ItaniumR , and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.
Categories and Subject Descriptors
D.2.5 [Software Engineering]: Testing and Debugging-code inspections and walk-throughs, debugging aids, tracing; D.3.4 [Programming Languages]: Processorscompilers, incremental compilers
Languages, Performance, Experimentation
Instrumentation, program analysis tools, dynamic compilation
Hypercube structures are heavily used by parallel algorithms that require all-to-all communication. When communicating over a heterogeneous and irregular network, the performance obtained by the hypercube structure will depend on the matching of the hypercube structure to the topology of the underlying network. In this paper, we present strategies to build topology-based hypercubes structures. These strategies do not assume any kind of topology. They take into account the communication cost between pair of nodes to provide a performance-efficient hypercube structure. These enhanced hypercube structures help improve the performance of parallel applications that require all-to-all communication in heterogeneous networks by up to ~30%.