Publications

2010
V. J. Reddi, B. Lee, T. Chilimbi, and K. Vaid, “Web Search Using Mobile Cores: Quantifying and Mitigating the Price of Efficiency,” in International Symposium on Computer Architecture, 2010. Publisher's VersionAbstract

The commoditization of hardware, data center economies of scale, and Internet-scale workload growth all demand greater power efficiency to sustain scalability. Traditional enterprise workloads, which are typically memory and I/O bound, have been well served by chip multiprocessors comprising of small, power-efficient cores. Recent advances in mobile computing have led to modern small cores capable of delivering even better power efficiency. While these cores can deliver performance-per-Watt efficiency for data center workloads, small cores impact application quality-of-service robustness, and flexibility, as these workloads increasingly invoke computationally intensive kernels. These challenges constitute the price of efficiency. We quantify efficiency for an industry-strength online web search engine in production at both the microarchitecture- and system-level, evaluating search on server and mobile-class architectures using Xeon and Atom processors.

Categories and Subject Descriptors

C.0 [Computer Systems Organization]: General—System architectures; C.4 [Computer Systems Organization]: Performance of Systems—Design studies, Reliability, availability, and serviceability

General Terms

Measurement, Experimentation, Performance

Paper
2009
M. S. Gupta, V. J. Reddi, M. D. Smith, G. - Y. Wei, and D. M. Brooks, “An Event-Guided Approach to Handling Inductive Noise in Processors,” in Design, Automation, and Test in Europe Conference (DATE-09), Nice, France, 2009. IEEE VersionAbstract

Supply voltage fluctuations that result from inductive noise are increasingly troublesome in modern microprocessors. A voltage “emergency”, i.e., a swing beyond tolerable operating margins, jeopardizes the safe and correct operation of the processor. Techniques aimed at reducing power consumption, e.g., by clock gating or by reducing nominal supply voltage, exacerbate this noise problem, requiring ever-wider operating margins. We propose an event-guided, adaptive method for avoiding voltage emergencies, which exploits the fact that most emergencies are correlated with unique microarchitectural events, such as cache misses or the pipeline flushes that follow branch mispredictions. Using checkpoint and rollback to handle unavoidable emergencies, our method adapts dynamically by learning to trigger avoidance mechanisms when emergency-prone events recur. After tightening supply voltage margins to increase clock frequency and accounting for all costs, the net result is a performance improvement of 8% across a suite of fifteen SPEC CPU2000 benchmarks.

PDF
M. S. Gupta, V. J. Reddi, G. Holloway, G. - Y. Wei, and D. M. Brooks, “An Event-Guided Approach to Reducing Voltage Noise in Processors,” in Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE'09. 2009, pp. 160–165. Publisher's Version
A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors, “PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures,” IEEE Transactions on Dependable and Secure Computing, vol. 6, no. 2, pp. 135–148, 2009. Publisher's VersionAbstract

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process, and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on 4-way SMP machine, and provides improved performance over existing software transient fault tolerance techniques with an 16.9% overhead for fault detection on a set of optimized SPEC2000 binaries.

Index Terms—fault tolerance, reliability, transient faults, soft errors, process-level redundancy

Paper
V. J. Reddi, M. S. Gupta, M. D. Smith, G. - Y. Wei, D. Brooks, and S. Campanoni, “Software-Assisted Hardware Reliability: Abstracting Circuit-Level Challenges to the Software Stack,” in Proceedings of the 46th Annual Design Automation Conference, 2009, pp. 788–793. Publisher's VersionAbstract

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program’s instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.

Categories and Subject Descriptors

C.0 [Computer Systems Organization]: General— Hardware/Software interfaces and System architectures.

General Terms

Performance, Reliability.

Keywords

Runtime Optimization, Hardware Software Co-Design.

Paper
V. J. Reddi, M. S. Gupta, G. Holloway, G. - Y. Wei, M. D. Smith, and D. Brooks, “Voltage Emergency Prediction: Using Signatures to Reduce Operating Margins,” in High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, 2009, pp. 18–29. Publisher's VersionAbstract

Inductive noise forces microprocessor designers to sacrifice performance in order to ensure correct and reliable operation of their designs. The possibility of wide fluctuations in supply voltage means that timing margins throughout the processor must be set pessimistically to protect against worst-case droops and surges. While sensor-based reactive schemes have been proposed to deal with voltage noise, inherent sensor delays limit their effectiveness. Instead, this paper describes a voltage emergency predictor that learns the signatures of voltage emergencies (the combinations of control flow and microarchitectural events leading up to them) and uses these signatures to prevent recurrence of the corresponding emergencies. In simulations of a representative superscalar microprocessor in which fluctuations beyond 4% of nominal voltage are treated as emergencies (an aggressive configuration), these signatures can pinpoint the likelihood of an emergency some 16 cycles ahead of time with 90% accuracy. This lead time allows machines to operate with much tighter voltage margins (4% instead of 13%) and up to 13.5% higher performance, which closely approaches the 14.2% performance improvement possible with an ideal oracle-based predictor.

Paper
V. J. Reddi, et al., “Voltage Noise: Why It’s Bad, and What To Do About It,” in 5th IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE), Palo Alto, CA, 2009.Abstract

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose hardware-software collaboration to enable aggressive voltage margins: a fail-safe hardware mechanism tolerates margin violations in order to train a run-time software layer that reschedules instructions to avoid recurring violations. Additionally, the software controls an emergency signature-based predictor that throttles to suppress emergencies that code rescheduling cannot eliminate.

Paper
2007
V. J. Reddi, D. Connors, R. Cohn, and M. D. Smith, “Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications,” in Code Generation and Optimization, 2007. CGO'07. International Symposium on, 2007, pp. 74–88. Publisher's VersionAbstract

Run-time compilation systems are challenged with the task of translating a program’s instruction stream while maintaining low overhead. While software managed code caches are utilized to amortize translation costs, they are ineffective for programs with short run times or large amounts of cold code. Such program characteristics are prevalent in real-life computing environments, ranging from Graphical User Interface (GUI) programs to large-scale applications such as database management systems. Persistent code caching addresses these issues. It is described and evaluated in an industry-strength dynamic binary instrumentation system – Pin. The proposed approach improves the intra-execution model of code reuse by storing and reusing translations across executions, thereby achieving inter-execution persistence. Dynamically linked programs leverage inter-application persistence by using persistent translations of library code generated by other programs. New translations discovered across executions are automatically accumulated into the persistent code caches, thereby improving performance over time. Inter-execution persistence improves the performance of GUI applications by nearly 90%, while inter-application persistence achieves a 59% improvement. In more specialized uses, the SPEC2K INT benchmark suite experiences a 26% improvement under dynamic binary instrumentation. Finally, a 400% speedup is achieved in translating the Oracle database in a regression testing environment.

Paper
T. Moseley, A. Shye, V. J. Reddi, D. Grunwald, and R. Peri, “Shadow Profiling: Hiding Instrumentation Costs with Parallelism,” in Proceedings of the International Symposium on Code Generation and Optimization, 2007, pp. 198–208. Publisher's VersionAbstract

In profiling, a tradeoff exists between information and overhead. For example, hardware-sampling profilers incur negligible overhead, but the information they collect is consequently very coarse. Other profilers use instrumentation tools to gather temporal traces such as path profiles and hot memory streams, but they have high overhead. Runtime and feedback-directed compilation systems need detailed information to aggressively optimize, but the cost of gathering profiles can outweigh the benefits. Shadow profiling is a novel method for sampling long traces of instrumented code in parallel with normal execution, taking advantage of the trend of increasing numbers of cores. Each instrumented sample can be many millions of instructions in length. The primary goal is to incur negligible overhead, yet attain profile information that is nearly as accurate as a perfect profile.

The profiler requires no modifications to the operating system or hardware, and is tunable to allow for greater coverage or lower overhead. We evaluate the performance and accuracy of this new profiling technique for two common types of instrumentation-based profiles: interprocedural path profiling and value profiling. Overall, profiles collected using the shadow profiling framework are 94% accurate versus perfect value profiles, while incurring less than 1% overhead. Consequently, this technique increases the viability of dynamic and continuous optimization systems by hiding the high overhead of instrumentation and enabling the online collection of many types of profiles that were previously too costly.

Paper
A. Shye, et al., “Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance,” in 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007.Abstract

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-threaded multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR). PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR’s softwarecentric approach to transient fault tolerance shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, PLR ignores many benign faults that do not propagate to affect program correctness. A real PLR prototype for running single-threaded applications is presented and evaluated for fault coverage and performance. On a 4-way SMP machine, PLR provides improved performance over existing software transient fault tolerance techniques with 16.9% overhead for fault detection on a set of optimized SPEC2000 binaries.

Paper Presentation
A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors, “Using process-level redundancy to exploit multiple cores for transient fault tolerance,” in Dependable Systems and Networks, 2007. DSN'07. 37th Annual IEEE/IFIP International Conference on, 2007, pp. 297–306.
2006
V. J. Reddi, “Deploying Dynamic Code Transformation in Modern Computing Environments,” University of Colorado, 2006.Abstract

Dynamic code transformation systems are steadily gaining acceptance in computing environments for services such as program optimization, translation, instrumentation and security. Code transformation systems are required to perform complex and time consuming tasks such as costly program analysis and apply transformations (i.e. instrumentation, translation etc.) As these steps are applied to all code regions (regardless of characteristics), the transformation overhead can be significant. Once transformed, the remaining overhead is determined by the performance of the translated code. Current code transformation systems can only become part of mainstream computing only if these overheads are eliminated. Nevertheless, certain application and computing environments exist in which code transformation systems can be effectively deployed. This thesis identifies two such environments, persistence and mixed execution. Persistence leverages previous execution characteristics to address the transformation overhead. This is accomplished by capturing the translated executions at the end of their first invocation. The captured executions are cached on disk for re-use. All subsequent invocations of the run-time system using the same application cause the system to reuse the cached executions. Since applications exhibit similar behavior across varying input data sets, this execution model successfully diminishes the transformation overhead across multiple invocations. Persistence in the domain of dynamic binary instrumentation is highlighted as an example. Mixed execution accepts that the performance of the code generated by today’s code transformation systems is in no position to compete with original execution times. Therefore, this technique proposes executing a mix of the original and translated code sequences to keep the translated code performance penalties within bounds. This execution model is a more effective alternative to pure Just-in-Time compiler-based code transformation systems, when low overheads and minimal architectural perturbation are the critical constraints required to be met. A dynamic compilation framework for controlling microprocessor energy and performance using this model is presented in light of its effectiveness and practicality.

Paper
R. Cohn, T. Moseley, and V. REDDI, “System and Method to Instrument References to Shared Memory”, US Patent:, 2006.
A. Shye, V. J. Reddi, T. Moseley, and D. A. Connors, “Transient fault tolerance via dynamic process-level redundancy,” in Proc. of Workshop on Binary Instrumentation and Applications, 2006.
2005
A. Shye, et al., “Analysis of Path Profiling Information Generated With Performance Monitoring Hardware,” Workshop on Interaction between Compilers and Computer Architectures (INTERACT). IEEE, pp. 34–43, 2005. IEEE VersionAbstract

Even with the breakthroughs in semiconductor technology that will enable billion transistor designs, hardwarebased architecture paradigms alone cannot substantially improve processor performance. The challenge in realizing the full potential of these future machines is to find ways to adapt program behavior to application needs and processor resources. As such, run-time optimization will have a distinct role in future high performance systems. However, as these systems are dependent on accurate, fine-grain profile information, traditional approaches to collecting profiles at run-time result in significant slowdowns during program execution.

A novel approach to low-overhead profiling is to exploit hardware Performance Monitoring Units (PMUs) present in modern microprocessors. The Itanium-2 PMU can periodically sample the last few taken branches in an executing program and this information can be used to recreate partial paths of execution. With compiler-aided analysis, the partial paths can be correlated into full paths. As statistically hot paths are most likely to occur in PMU samples, even infrequent sampling can accurately identify these paths. While traditional path profiling techniques carry a high overhead, a PMU-based path profiler represents an effective lightweight profiling alternative. This paper characterizes the PMU-based path information and demonstrates the construction of such a PMU-based path profiler for a run-time system.

PDF
A. Shye, M. Iyer, V. J. Reddi, and D. A. Connors, “Code Coverage Testing Using Hardware Performance Monitoring Support,” in Proceedings of the sixth international symposium on Automated analysis-driven debugging, 2005, pp. 159–163. Publisher's VersionAbstract

Code coverage analysis, the process of finding code exercised by a particular set of test inputs, is an important component of software development and verification. Most traditional methods of implementing code coverage analysis tools are based on program instrumentation. These methods typically incur high overhead due to the insertion and execution of instrumentation code, and are not deployable in many software environments. Hardware-based sampling techniques attempt to lower overhead by leveraging existing Hardware Performance Monitoring (HPM) support for program counter (PC) sampling. While PC-sampling incurs lower levels of overhead, it does not provide complete coverage information. This paper extends the HPM approach in two ways. First, it utilizes the sampling of branch vectors which are supported on modern processors. Second, compiler analysis is performed on branch vectors to extend the amount of code coverage information derived from each sample. This paper shows that although HPM is generally used to guide performance improvement efforts, there is substantial promise in leveraging the HPM information for code debugging and verification. The combination of sampled branch vectors and compiler analysis can be used to attain upwards of 80% of the actual code coverage.

Paper
Q. Wu, et al., “A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance,” in Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005, pp. 271–282. Publisher's VersionAbstract

Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS timeinterrupts, or static-compiler techniques. However, substantially greater gains can be realized when control opportunities are also explored in a dynamic compilation environment. There are several advantages to deploying DVFS and managing energy/performance tradeoffs through the use of a dynamic compiler. Most importantly, dynamic compiler driven DVFS is fine-grained, code-aware, and adaptive to the current microarchitecture environment. This paper presents a design framework of the run-time DVFS optimizer in a general dynamic compilation system. A prototype of the DVFS optimizer isimplemented and integrated into an industrialstrength dynamic compilation system. The obtained optimization system is deployed in a real hardware platform that directly measures CPU voltage and current for accurate power and energy readings. Experimental results, based on physical measurements for over 40 SPEC or Olden benchmarks, show that significant energy savings are achieved with little performance degradation. SPEC2K FP benchmarks benefit with energy savings of up to 70% (with 0.5% performance loss). In addition, SPEC2K INT show up to 44% energy savings (with 5% performance loss), SPEC95 FP save up to 64% (with 4.9% performance loss), and Olden save up to 61% (with 4.5% performance loss). On average, the technique leads to an energy delay product (EDP) improvement that is 3X-5X better than static voltage scaling, and is more than 2X (22% vs. 9%) better than the reported DVFS results of prior static compiler work. While the proposed technique is an effective method for microprocessor voltage and frequency control, the design framework and methodology described in this paper have broader potential to address other energy and power issues such as di/dt and thermal control.

PDF
T. Moseley, et al., “Dynamic Run-time Architecture Techniques For Enabling Continuous Optimization,” in Proceedings of the 2nd conference on Computing frontiers, 2005, pp. 211–220. Publisher's VersionAbstract

Future computer systems will integrate tens of multithreaded processor cores on a single chip die, resulting in hundreds of concurrent program threads sharing system resources. These designs will be the cornerstone of improving throughput in high-performance computing and server environments. However, to date, appropriate systems software (operating system, run-time system, and compiler) technologies for these emerging machines have not been adequately explored. Future processors will require sophisticated hardware monitoring units to continuously feed back resource utilization information to allow the operating system to make optimal thread co-scheduling decisions and also to software that continuously optimizes the program itself. Nevertheless, in order to continually and automatically adapt systems resources to program behaviors and application needs, specific run-time information must be collected to adequately enable dynamic code optimization and operating system scheduling. Generally, run-time optimization is limited by the time required to collect profiles, the time required to perform optimization, and the inherent benefits of any optimization or decisions. Initial techniques for effectively utilizing runtime information for dynamic optimization and informed thread scheduling in future multithreaded architectures are presented.

Paper
V. J. Reddi, D. Connors, and R. S. Cohn, “Persistence in Dynamic Code Transformation Systems,” ACM SIGARCH Computer Architecture News, vol. 33, no. 5. ACM, pp. 69–74, 2005. Publisher's VersionAbstract

Dynamic code transformation systems (DCTS) can broadly be grouped into three distinct categories: optimization, translation and instrumentation. All of these face the critical challenge of minimizing the overhead incurred during transformation since their execution is interleaved with the execution of the application itself. The common DCTS tasks incurring overhead are the identification of frequently executed code sequences, costly analysis of program information, and run-time creation (writing) of new code sequences. The cost of such work is amortized by the repeated execution of the transformed code. However, as these steps are applied to all general code regions (regardless of their execution frequency and characteristics), there is substantial overhead that impacts the application’s performance. As such, it is challenging to effectively deploy dynamic transformation under fixed performance constraints. This paper explores a technique for eliminating the overhead incurred by exploiting persistent application execution characteristics that are shared across different application invocations. This technique is implemented and evaluated in Pin, a dynamic instrumentation engine. This version of Pin is referred to as Persistent Pin (PPin). Initial PPin experimental results indicate that using information from prior runs can reduce dynamic instrumentation overhead of SPEC applications by as much as 25% and over 90% for everyday applications like web browsers, display rendering systems, and spreadsheet programs.

Paper
C. - K. Luk, et al., “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,” in Programming Language Design and Implementation (PLDI), 2005, no. 6. Publisher's VersionAbstract

Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin’s rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application’s original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin’s versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), ItaniumR , and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

Categories and Subject Descriptors

D.2.5 [Software Engineering]: Testing and Debugging-code inspections and walk-throughs, debugging aids, tracing; D.3.4 [Programming Languages]: Processorscompilers, incremental compilers

General Terms

Languages, Performance, Experimentation

Keywords

Instrumentation, program analysis tools, dynamic compilation

Paper

Pages