Accelerators are becoming popular owing to their exceptional performance and power-efficiency. However, researchers are yet to pay close attention to their reliability---a key challenge as technology scaling makes building reliable systems challenging. A straightforward solution to make accelerators reliable is to design the accelerator from the ground-up to be reliable by itself. However, such a myopic view of the system, where each accelerator is designed in isolation, is unsustainable as the number of integrated accelerators continues to rise in SoCs. To address this challenge, we propose a paradigm called "asymmetric resilience'' that avoids accelerator-specific reliability design. Instead, its core principle is to develop the reliable heterogeneous system around the CPU architecture. We explain the implications of architecting such a system and the modifications needed in a heterogeneous system to adopt such an approach. As an example, we demonstrate how to use asymmetric resilience to handle GPU execution errors using the CPU with minimal overhead. The general principles can be extended to include other accelerators.
Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of the most efficient methods to reduce power consumption of a chip. With the galloping adoption of hardware accelerators (i.e., GPUs and FPGAs) in large datacenters and other large-scale computing infrastructures, a comprehensive evaluation of the safe voltage reduction levels for each different chip can be employed for efficient reduction of the total power. We present a survey of recent studies in voltage margins reduction at the system level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands inserted by the silicon vendors can be exploited in all devices for significant power savings. Voltage reduction can reach 12% in multicore CPUs, 20% in manycore GPUs and 39% in FPGAs.
Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phasedivergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling.
To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS’ hardware co
Energy efficiency is undoubtedly important for GPU architectures. Besides the traditionally explored energy-efficiency optimization techniques, exploiting the supply voltage guardband remains a promising yet unexplored opportunity. Our hardware measurements show that up to 23% of the nominal supply voltage can be eliminated to improve GPU energy efficiency by as much as 25%. The key obstacle for exploiting this opportunity lies in understanding the characteristics and root causes of large voltage droops in GPU architectures and subsequently smoothing them away without severe performance penalties. The GPU’s manycore nature complicates the voltage noise phenomenon, and its distinctive architecture features from the CPU necessitate a GPU-specific voltage noise analysis. In this paper, we make the following contributions. First, we provide a voltage noise categorization framework to identify, characterize, and understand voltage noise in the manycore GPU architecture. Second, we perform a microarchitecture-level voltage-droop root-cause analysis for the two major droop types we identify, namely the local first-order droop and the global second-order droop. Third, on the basis of our categorization and characterization, we propose a hierarchical voltage smoothing mechanism that mitigates each type of voltage droop. Our evaluation shows it can reduce up to 31% worst-case droop, which translates to 11.8% core-level and 7.8% processor-level energy reduction
Energy eciency of GPU architectures has emerged as an important aspect of computer system design. In this paper, we explore the energy benefits of reducing the GPU chip’s voltage to the safe limit, i.e. Vmin point. We perform such a study on several commercial o↵- the-shelf GPU cards. We find that there exists about 20% voltage guardband on those GPUs spanning two architectural generations, which, if “eliminated” completely, can result in up to 25% energy savings on one of the studied GPU cards. The exact improvement magnitude depends on the program’s available guardband, because our measurement results unveil a program dependent Vmin behavior across the studied programs. We make fundamental observations about the programdependent Vmin behavior. We experimentally determine that the voltage noise has a larger impact on Vmin compared to the process and temperature variation, and the activities during the kernel execution cause large voltage droops. From these findings, we show how to use a kernel’s microarchitectural performance counters to predict its Vmin value accurately. The average and maximum prediction errors are 0.5% and 3%, respectively. The accurate Vmin prediction opens up new possibilities of a cross-layer dynamic guardbanding scheme for GPUs, in which software predicts and manages the voltage guardband, while the functional correctness is ensured by a hardware safety net mechanism.
Energy efficiency of GPU architectures has emerged as an important design criterion for both NVIDIA and AMD. In this paper, we explore the benefits of scaling a generalpurpose GPU (GPGPU) core’s supply voltage to the near limits of execution failure. We find that as much as 21% of NVIDIA GTX 680’s core supply voltage guardband can be eliminated to achieve significant energy efficiency improvement. Measured results indicate that the energy improvements can be as high as 25% without any performance loss. The challenge, however, is to understand what impacts the minimum voltage guardband and how the guardband can be scaled without compromising correctness. We show that GPU microarchitectural activity patterns caused by different program characteristics are the root cause(s) of the large voltage guardband. We also demonstrate how microarchitecture-level parameters, such as clock frequency and the number of cores, impact the guardband. We hope our preliminary analysis lays the groundwork for future research.
Voltage noise is a major obstacle in improving processor energy eciency because it necessitates large operating voltage guardbands that increase overall power consumption and limit peak performance. Identifying the leading root causes of voltage noise is essential to minimize the unnecessary guardband and maximize the overall energy eciency. We provide the first-ever modeling and characterization of voltage noise in GPUs based on a new simulation infrastructure called GPUVolt. Using it, we identify the key intracore microarchitectural components (e.g., the register file and special functional units) that significantly impact the GPU’s voltage noise. We also demonstrate that intercore-aligned microarchitectural activity detrimentally impacts the chipwide worst-case voltage droops. On the basis of these findings, we propose a combined register-file and execution-unit throttling mechanism that smooths GPU voltage noise and reduces the guardband requirement by as much as 29%.
Categories and Subject Descriptors
C.4 [Performance of Systems]: Modeling techniques, Reliability, availability, and serviceability
— The intrinsic robustness of an algorithm and architecture depends highly on the combined ability tolerate noise. In this paper, we present an alternative approach for energy reduction for near threshold computing based on a statistical modeling of computational noise induced from noisy memory and non-ideal interconnects. We present this approach as a complement to the standard approximate computing approaches. We show results of the lightweight error checks and recovery based on several design considerations on data value speculation.
Index Terms—Approximate computing, noise resiliency, computation noise, near threshold computing
In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a IntelR CoreTM2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typicalcase, resilient design.
Shrinking feature size and diminishing supply voltage are making circuits sensitive to supply voltage fluctuations within the microprocessor, caused by normal workload activity changes. If left unattended,voltage fluctuations can lead to timing violations or even transistor lifetime issues that degrade processor robustness. Mechanisms that learn to tolerate, avoid, and eliminate voltage fluctuations based on program and microarchitectural events can help steer the processor clear of danger, thus enabling tighter voltage margins that improve performance or lower power consumption.We describe the problem of voltage variation and the factors that influence this variation during processor design and operation. We also describe a variety of runtime hardware and software mitigation techniques that either tolerate, avoid, and/or eliminate voltage violations.We hope processor architects will find the information useful since tolerance, avoidance, and elimination are generalizable constructs that can serve as a basis for addressing other reliability challenges as well.
voltage noise, voltage smoothing, di dt , inductive noise, voltage emergencies, error detection, error correction, error recovery, transient errors, power supply noise, power delivery networks
The semiconductor industry is facing a critical research challenge: design future high performance and energy efficient systems while satisfying historical standards for reliability and lower costs. The primary cause of this challenge is device and circuit parameter variability, which results from the manufacturing process and system operation. As technology scales, the adverse impact of these variations on system-level metrics increases. In this paper, we describe an interdisciplinary effort toward robust and resilient designs that mitigate the effects of device and circuit parameter variations in order to enhance system performance, energy efficiency, and reliability. Collaboration between the technology, CAD, circuit, and system levels of the compute hierarchy can foster the development of cost-effective and efficient solutions.
Unintended variations in circuit lithography and undesirable fluctuations in circuit operating parameters such as supply voltage and temperature are threatening the continuation of technology scaling that microprocessor evolution relies on. Although circuit-level solutions for some variation problems may be possible, they are prohibitively expensive and impractical for commodity processors, on which not only the consumer market but also an increasing segment of the business market now depends. Solutions at the microarchitecture level and even the software level, on the other hand, overcome some of these circuitlevel challenges without significantly raising costs or lowering performance. Using examples drawn from our Alarms Project and related work, we illustrate how collaborative design that encompasses circuits, architecture, and chip-resident software leads to a cost-effective solution for inductive voltage noise, sometimes called the dI/dt problem.
The strategy that we use for assuring correctness while preserving performance can be extended to other variation problems. Index Terms—Dynamic variation, error correction, error detection, error recovery, error resiliency, hw/sw co-design, inductive noise, power supply noise, reliability, resilient design, resilient microprocessor, timing error, variation, voltage droop.
In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this paper, we present a hardware-software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a run-time software layer reschedules the program’s instruction stream to prevent recurring violations at the same program location. The run-time layer, combined with the proposed code rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry standard approach to circumvent the issue altogether by optimizing for the worst case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance
General Terms: Performance, Reliability
Additional Key Words and Phrases: Voltage Noise, dI/dt, Inductive Noise, Voltage Emergencies
In the era of nanoscale technology scaling, we are facing the limits of physics, challenging robust and reliable microprocessor design and fabrication. As these trends continue, guaranteeing correctness of execution is becoming prohibitively expensive and impractical. In this thesis, we demonstrate the benefits of abstracting circuit-level challenges to the architecture and software layers. Reliability challenges are broadly classified into process, voltage, and thermal variations. As proof of concept, we target voltage variation, which is least understood, demonstrating its growing detrimental effects on future processors: Shrinking feature size and diminishing supply voltage are making circuits more sensitive to supply voltage fluctuations within the microprocessor. If left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. This problem, more commonly known as the dI/dt problem, is forcing microprocessor designers to increasingly sacrifice processor performance, as well as power efficiency, in order to guarantee correctness and robustness of operation. Industry addresses this problem by un-optimizing the processor for the worst case voltage flux. Setting such extreme operating voltage margins for those large and infrequent voltage swings is not a sustainable solution in the long term. Therefore, we depart from this traditional strategy and operate the processor under more typical case conditions. We demonstrate that a collaborative architecture between hardware and software enables aggressive operating voltage margins, and as a consequence improves processor performance and power efficiency. This co-designed architecture is built on the principles of tolerance, avoidance and elimination. Using a fail-safe hardware mechanism to tolerate voltage margin violations, we enable timing speculation, while a run-time hardware and software layer attempts to not only predict and avoid impending violations, but also reschedules instructions and co-schedules threads intelligently to eliminate voltage violations altogether. We believe tolerance, avoidance and elimination are generalizable constructs capable of acting as guidelines to address and successfully mitigate the other parameter-related reliability challenges as well.
Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is es- pecially daunting because it happens rapidly. We measure and characterize voltage variation in a running Intel⃝R CoreTM2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi- threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4%, far from the processor’s 14% worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With the investigated processors, such resilient designs could yield 15% to 20% performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing microarchitectural activity that leads to voltage swings within multi-core systems, we show two software techniques that have the potential to mitigate such voltage emergencies. A voltage-aware compiler can choose to de-optimize for performance in favor of better noise behavior, while a thread scheduler can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.
More than 20% of the available energy is lost in “the last centimeter” from the PCB board to the microprocessor chip due to inherent inefficiencies of power delivery subsystems (PDSs) in today’s computing systems. By series-stacking multiple voltage domains to eliminate explicit voltage conversion and reduce loss along the power delivery path, voltage stacking (VS) is a novel configuration that can improve power delivery efficiency (PDE). However, VS suffers from aggravated levels of supply noise caused by current imbalance between the stacking layers, preventing its practical adoption in mainstream computing systems. Throughput-centric manycore architectures such as GPUs intrinsically exhibit more balanced workloads, yet suffer from lower PDE, making them ideal platforms to implement voltage stacking. In this paper, we present a cross-layer approach to practical voltage stacking implementation in GPUs. It combines circuit-level voltage regulation using distributed charge-recycling integrated voltage regulators (CR-IVRs) with architecture-level voltage smoothing guided by control theory. Our proposed voltage-stacked GPUs can eliminate 61.5% of total PDS energy loss and achieve 92.3% system-level power delivery efficiency, a 12.3% improvement over the conventional single-layer based PDS. Compared to the circuit-only solution, the cross-layer approach significantly reduces the implementation cost of voltage stacking (88% reduction in area overhead) without compromising supply reliability under worst-case scenarios and across a wide range of real-world benchmarks. In addition, we demonstrate that the cross-layer solution not only complements on-chip CR-IVRs to transparently manage current imbalance and restore stable layer voltages, but also serves as a seamless interface to accommodate higher-level power optimization techniques, traditionally thought to be incompatible with a VS configuration.
Supply voltage fluctuations that result from inductive noise are increasingly troublesome in modern microprocessors. A voltage “emergency”, i.e., a swing beyond tolerable operating margins, jeopardizes the safe and correct operation of the processor. Techniques aimed at reducing power consumption, e.g., by clock gating or by reducing nominal supply voltage, exacerbate this noise problem, requiring ever-wider operating margins. We propose an event-guided, adaptive method for avoiding voltage emergencies, which exploits the fact that most emergencies are correlated with unique microarchitectural events, such as cache misses or the pipeline flushes that follow branch mispredictions. Using checkpoint and rollback to handle unavoidable emergencies, our method adapts dynamically by learning to trigger avoidance mechanisms when emergency-prone events recur. After tightening supply voltage margins to increase clock frequency and accounting for all costs, the net result is a performance improvement of 8% across a suite of fifteen SPEC CPU2000 benchmarks.