Reliability

2014
S. Chai, D. Zhang, J. Leng, and V. J. Reddi, “Lightweight Detection and Recovery Mechanisms to Extend Algorithm Resiliency in Noisy Computation,” Workshop on Near-threshold Computing (WNTC). 2014.Abstract

— The intrinsic robustness of an algorithm and architecture depends highly on the combined ability tolerate noise. In this paper, we present an alternative approach for energy reduction for near threshold computing based on a statistical modeling of computational noise induced from noisy memory and non-ideal interconnects. We present this approach as a complement to the standard approximate computing approaches. We show results of the lightweight error checks and recovery based on several design considerations on data value speculation.

Index Terms—Approximate computing, noise resiliency, computation noise, near threshold computing

Paper
2013
J. Leng, et al., “GPUWattch: Enabling Energy Optimizations in GPGPUs,” in ACM SIGARCH Computer Architecture News, 2013, vol. 41, no. 3, pp. 487–498. Publisher's VersionAbstract

General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model’s inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

Categories and Subject Descriptors

C.1.4 [Processor Architectures]: Parallel Architectures; C.4 [Performance of Systems]: Modeling techniques

General Terms

Experimentation, Measurement, Power, Performance

Keywords

Energy, CUDA, GPU architecture, Power estimation

Paper
S. Kanev, T. M. Jones, G. - Y. Wei, D. M. Brooks, and V. J. Reddi, “Measuring Code Optimization Impact on Voltage Noise,” Workshop on Silicon Errors in Logic - System Effects (SELSE). 2013.Abstract

In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a IntelR CoreTM2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typicalcase, resilient design.

Paper
V. J. Reddi, “Reliability-Aware Microarchitecture Design,” IEEE Micro, no. 4, pp. 4–5, 2013. Publisher's Version
V. J. Reddi and M. S. Gupta, Resilient Architecture Design for Voltage Variation, vol. 8, no. 2. Morgan & Claypool Publishers, 2013, pp. 1–138. Publisher's VersionAbstract

Shrinking feature size and diminishing supply voltage are making circuits sensitive to supply voltage fluctuations within the microprocessor, caused by normal workload activity changes. If left unattended,voltage fluctuations can lead to timing violations or even transistor lifetime issues that degrade processor robustness. Mechanisms that learn to tolerate, avoid, and eliminate voltage fluctuations based on program and microarchitectural events can help steer the processor clear of danger, thus enabling tighter voltage margins that improve performance or lower power consumption.We describe the problem of voltage variation and the factors that influence this variation during processor design and operation. We also describe a variety of runtime hardware and software mitigation techniques that either tolerate, avoid, and/or eliminate voltage violations.We hope processor architects will find the information useful since tolerance, avoidance, and elimination are generalizable constructs that can serve as a basis for addressing other reliability challenges as well.

KEYWORDS

voltage noise, voltage smoothing, di dt , inductive noise, voltage emergencies, error detection, error correction, error recovery, transient errors, power supply noise, power delivery networks

Paper
2012
V. J. Reddi, “Hardware and Software Co-Design for Robust and Resilient Execution,” in Collaboration Technologies and Systems (CTS), 2012 International Conference on, 2012, pp. 380–380.
V. J. Reddi, D. Z. Pan, S. R. Nassif, and K. A. Bowman, “Robust and Resilient Designs from the Bottom-Up: Technology, CAD, Circuit, and System Issues,” in Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, 2012, pp. 7–16. Publisher's VersionAbstract

The semiconductor industry is facing a critical research challenge: design future high performance and energy efficient systems while satisfying historical standards for reliability and lower costs. The primary cause of this challenge is device and circuit parameter variability, which results from the manufacturing process and system operation. As technology scales, the adverse impact of these variations on system-level metrics increases. In this paper, we describe an interdisciplinary effort toward robust and resilient designs that mitigate the effects of device and circuit parameter variations in order to enhance system performance, energy efficiency, and reliability. Collaboration between the technology, CAD, circuit, and system levels of the compute hierarchy can foster the development of cost-effective and efficient solutions.

Paper
2011
V. J. Reddi and D. Brooks, “Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 10, pp. 1429–1445, 2011. Publisher's VersionAbstract

Unintended variations in circuit lithography and undesirable fluctuations in circuit operating parameters such as supply voltage and temperature are threatening the continuation of technology scaling that microprocessor evolution relies on. Although circuit-level solutions for some variation problems may be possible, they are prohibitively expensive and impractical for commodity processors, on which not only the consumer market but also an increasing segment of the business market now depends. Solutions at the microarchitecture level and even the software level, on the other hand, overcome some of these circuitlevel challenges without significantly raising costs or lowering performance. Using examples drawn from our Alarms Project and related work, we illustrate how collaborative design that encompasses circuits, architecture, and chip-resident software leads to a cost-effective solution for inductive voltage noise, sometimes called the dI/dt problem.

The strategy that we use for assuring correctness while preserving performance can be extended to other variation problems. Index Terms—Dynamic variation, error correction, error detection, error recovery, error resiliency, hw/sw co-design, inductive noise, power supply noise, reliability, resilient design, resilient microprocessor, timing error, variation, voltage droop.

Paper
2010
V. J. Reddi, et al., “Eliminating Voltage Emergencies via Software-Guided Code Transformations,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 7, no. 2, pp. 12, 2010. Publisher's VersionAbstract

In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this paper, we present a hardware-software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a run-time software layer reschedules the program’s instruction stream to prevent recurring violations at the same program location. The run-time layer, combined with the proposed code rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry standard approach to circumvent the issue altogether by optimizing for the worst case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.

Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance

General Terms: Performance, Reliability

Additional Key Words and Phrases: Voltage Noise, dI/dt, Inductive Noise, Voltage Emergencies

Paper
V. J. Reddi, “Software-Assisted Hardware Reliability: Enabling Aggressive Timing Speculation Using Run-time Feedback from Hardware and Software,” Harvard University, 2010. Publisher's VersionAbstract

In the era of nanoscale technology scaling, we are facing the limits of physics, challenging robust and reliable microprocessor design and fabrication. As these trends continue, guaranteeing correctness of execution is becoming prohibitively expensive and impractical. In this thesis, we demonstrate the benefits of abstracting circuit-level challenges to the architecture and software layers. Reliability challenges are broadly classified into process, voltage, and thermal variations. As proof of concept, we target voltage variation, which is least understood, demonstrating its growing detrimental effects on future processors: Shrinking feature size and diminishing supply voltage are making circuits more sensitive to supply voltage fluctuations within the microprocessor. If left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. This problem, more commonly known as the dI/dt problem, is forcing microprocessor designers to increasingly sacrifice processor performance, as well as power efficiency, in order to guarantee correctness and robustness of operation. Industry addresses this problem by un-optimizing the processor for the worst case voltage flux. Setting such extreme operating voltage margins for those large and infrequent voltage swings is not a sustainable solution in the long term. Therefore, we depart from this traditional strategy and operate the processor under more typical case conditions. We demonstrate that a collaborative architecture between hardware and software enables aggressive operating voltage margins, and as a consequence improves processor performance and power efficiency. This co-designed architecture is built on the principles of tolerance, avoidance and elimination. Using a fail-safe hardware mechanism to tolerate voltage margin violations, we enable timing speculation, while a run-time hardware and software layer attempts to not only predict and avoid impending violations, but also reschedules instructions and co-schedules threads intelligently to eliminate voltage violations altogether. We believe tolerance, avoidance and elimination are generalizable constructs capable of acting as guidelines to address and successfully mitigate the other parameter-related reliability challenges as well.

Paper
S. Kanev, et al., “A System-Level View of Voltage Noise in Production Processors,” ACM Transactions on Architecture and Code Optimization, vol. 9, no. 4, 2010.Abstract

Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is es- pecially daunting because it happens rapidly. We measure and characterize voltage variation in a running Intel⃝R CoreTM2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi- threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4%, far from the processor’s 14% worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With the investigated processors, such resilient designs could yield 15% to 20% performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing microarchitectural activity that leads to voltage swings within multi-core systems, we show two software techniques that have the potential to mitigate such voltage emergencies. A voltage-aware compiler can choose to de-optimize for performance in favor of better noise behavior, while a thread scheduler can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.

PDF
V. J. Reddi, et al., “Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling,” in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010, pp. 77–88. Publisher's VersionAbstract

More than 20% of the available energy is lost in “the last centimeter” from the PCB board to the microprocessor chip due to inherent inefficiencies of power delivery subsystems (PDSs) in today’s computing systems. By series-stacking multiple voltage domains to eliminate explicit voltage conversion and reduce loss along the power delivery path, voltage stacking (VS) is a novel configuration that can improve power delivery efficiency (PDE). However, VS suffers from aggravated levels of supply noise caused by current imbalance between the stacking layers, preventing its practical adoption in mainstream computing systems. Throughput-centric manycore architectures such as GPUs intrinsically exhibit more balanced workloads, yet suffer from lower PDE, making them ideal platforms to implement voltage stacking. In this paper, we present a cross-layer approach to practical voltage stacking implementation in GPUs. It combines circuit-level voltage regulation using distributed charge-recycling integrated voltage regulators (CR-IVRs) with architecture-level voltage smoothing guided by control theory. Our proposed voltage-stacked GPUs can eliminate 61.5% of total PDS energy loss and achieve 92.3% system-level power delivery efficiency, a 12.3% improvement over the conventional single-layer based PDS. Compared to the circuit-only solution, the cross-layer approach significantly reduces the implementation cost of voltage stacking (88% reduction in area overhead) without compromising supply reliability under worst-case scenarios and across a wide range of real-world benchmarks. In addition, we demonstrate that the cross-layer solution not only complements on-chip CR-IVRs to transparently manage current imbalance and restore stable layer voltages, but also serves as a seamless interface to accommodate higher-level power optimization techniques, traditionally thought to be incompatible with a VS configuration.

Paper
2009
M. S. Gupta, V. J. Reddi, M. D. Smith, G. - Y. Wei, and D. M. Brooks, “An Event-Guided Approach to Handling Inductive Noise in Processors,” in Design, Automation, and Test in Europe Conference (DATE-09), Nice, France, 2009. IEEE VersionAbstract

Supply voltage fluctuations that result from inductive noise are increasingly troublesome in modern microprocessors. A voltage “emergency”, i.e., a swing beyond tolerable operating margins, jeopardizes the safe and correct operation of the processor. Techniques aimed at reducing power consumption, e.g., by clock gating or by reducing nominal supply voltage, exacerbate this noise problem, requiring ever-wider operating margins. We propose an event-guided, adaptive method for avoiding voltage emergencies, which exploits the fact that most emergencies are correlated with unique microarchitectural events, such as cache misses or the pipeline flushes that follow branch mispredictions. Using checkpoint and rollback to handle unavoidable emergencies, our method adapts dynamically by learning to trigger avoidance mechanisms when emergency-prone events recur. After tightening supply voltage margins to increase clock frequency and accounting for all costs, the net result is a performance improvement of 8% across a suite of fifteen SPEC CPU2000 benchmarks.

PDF
M. S. Gupta, V. J. Reddi, G. Holloway, G. - Y. Wei, and D. M. Brooks, “An Event-Guided Approach to Reducing Voltage Noise in Processors,” in Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE'09. 2009, pp. 160–165. Publisher's Version
V. J. Reddi, M. S. Gupta, M. D. Smith, G. - Y. Wei, D. Brooks, and S. Campanoni, “Software-Assisted Hardware Reliability: Abstracting Circuit-Level Challenges to the Software Stack,” in Proceedings of the 46th Annual Design Automation Conference, 2009, pp. 788–793. Publisher's VersionAbstract

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program’s instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.

Categories and Subject Descriptors

C.0 [Computer Systems Organization]: General— Hardware/Software interfaces and System architectures.

General Terms

Performance, Reliability.

Keywords

Runtime Optimization, Hardware Software Co-Design.

Paper
V. J. Reddi, M. S. Gupta, G. Holloway, G. - Y. Wei, M. D. Smith, and D. Brooks, “Voltage Emergency Prediction: Using Signatures to Reduce Operating Margins,” in High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, 2009, pp. 18–29. Publisher's VersionAbstract

Inductive noise forces microprocessor designers to sacrifice performance in order to ensure correct and reliable operation of their designs. The possibility of wide fluctuations in supply voltage means that timing margins throughout the processor must be set pessimistically to protect against worst-case droops and surges. While sensor-based reactive schemes have been proposed to deal with voltage noise, inherent sensor delays limit their effectiveness. Instead, this paper describes a voltage emergency predictor that learns the signatures of voltage emergencies (the combinations of control flow and microarchitectural events leading up to them) and uses these signatures to prevent recurrence of the corresponding emergencies. In simulations of a representative superscalar microprocessor in which fluctuations beyond 4% of nominal voltage are treated as emergencies (an aggressive configuration), these signatures can pinpoint the likelihood of an emergency some 16 cycles ahead of time with 90% accuracy. This lead time allows machines to operate with much tighter voltage margins (4% instead of 13%) and up to 13.5% higher performance, which closely approaches the 14.2% performance improvement possible with an ideal oracle-based predictor.

Paper
V. J. Reddi, et al., “Voltage Noise: Why It’s Bad, and What To Do About It,” in 5th IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE), Palo Alto, CA, 2009.Abstract

Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose hardware-software collaboration to enable aggressive voltage margins: a fail-safe hardware mechanism tolerates margin violations in order to train a run-time software layer that reschedules instructions to avoid recurring violations. Additionally, the software controls an emergency signature-based predictor that throttles to suppress emergencies that code rescheduling cannot eliminate.

Paper
2007
A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors, “Using process-level redundancy to exploit multiple cores for transient fault tolerance,” in Dependable Systems and Networks, 2007. DSN'07. 37th Annual IEEE/IFIP International Conference on, 2007, pp. 297–306.
A. Shye, et al., “Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance,” in 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007.Abstract

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-threaded multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR). PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR’s softwarecentric approach to transient fault tolerance shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, PLR ignores many benign faults that do not propagate to affect program correctness. A real PLR prototype for running single-threaded applications is presented and evaluated for fault coverage and performance. On a 4-way SMP machine, PLR provides improved performance over existing software transient fault tolerance techniques with 16.9% overhead for fault detection on a set of optimized SPEC2000 binaries.

Paper Presentation
2006
A. Shye, V. J. Reddi, T. Moseley, and D. A. Connors, “Transient fault tolerance via dynamic process-level redundancy,” in Proc. of Workshop on Binary Instrumentation and Applications, 2006.

Pages