J. Leng, et al., “
Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems,” in
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 44-57.
AbstractAccelerators make the task of building systems that are re-silient against transient errors like voltage noise and soft errors hard. Architects integrate accelerators into the system as black box third-party IP components. So a fault in one or more accelerators may threaten the system's reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Existing solutions that assure system reliability come at the cost of sacrificing accelerator generality, efficiency, and incur significant overhead, even in the absence of errors. To over-come these drawbacks, we examine reliability management of accelerator systems via hardware-software co-design, coupling an efficient architecture design with compiler and run-time support, to cope with transient errors. We introduce asymmetric resilience that architects reliability at the system level, centered around a hardened CPU, rather than at the accelerator level. At runtime, the system exploits task-level idempotency to contain accelerator errors and use memory protection instead of taking checkpoints to mitigate over-heads. We also leverage the fact that errors rarely occur in systems, and exploit the trade-off between error recovery performance and improved error-free performance to enhance system efficiency. Using GPUs, which are at the fore-front of accelerator systems, we demonstrate how our system architecture manages reliability in both integrated and discrete systems, under voltage-noise and soft-error related faults, leading to extremely low overhead (less than 1%) and substantial gains (20% energy savings on average).
G. Papadimitriou, et al., “
Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins,”
IEEE Transactions on Device and Materials Reliability, vol. 20, no. 2, pp. 341-350, 2020.
AbstractModern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of the most efficient methods to reduce power consumption of a chip. With the galloping adoption of hardware accelerators (i.e., GPUs and FPGAs) in large datacenters and other large-scale computing infrastructures, a comprehensive evaluation of the safe voltage reduction levels for each different chip can be employed for efficient reduction of the total power. We present a survey of recent studies in voltage margins reduction at the system level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands inserted by the silicon vendors can be exploited in all devices for significant power savings. On average, voltage reduction can reach 12% in multicore CPUs, 20% in manycore GPUs and 39% in FPGAs.
J. Leng, A. Buyuktosunoglu, R. Bertran, P. Bose, Y. Zu, and V. J. Reddi, “
Predictive Guardbanding: Program-driven Timing Margin Reduction for GPUs,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1-1, 2020.
AbstractEnergy efficiency of GPU architectures has emerged as an essential aspect of computer system design. In this paper, we explore the energy benefits of reducing the GPU chip’s voltage to the safe limit, i.e., Vmin point, using predictive software techniques. We perform such a study on several commercial off-the-shelf GPU cards. We find that there exists about 20% voltage guardband on those GPUs spanning two architectural generations, which, if “eliminated"entirely, can result in up to 25% energy savings on one of the studied GPU cards. Our measurement results unveil a program dependent Vmin behavior across the studied applications, and the exact improvement magnitude depends on the program’s available guardband. We make fundamental observations about the program-dependent Vmin behavior. We experimentally determine that the voltage noise has a more substantial impact on Vmin compared to the process and temperature variation, and the activities during the kernel execution cause large voltage droops. From these findings, we show how to use kernels’ microarchitectural performance counters to predict its Vmin value accurately. The average and maximum prediction errors are 0.5% and 3%, respectively. The accurate Vmin prediction opens up new possibilities of a cross-layer dynamic guardbanding scheme for GPUs, in which software predicts and manages the voltage guardband, while the functional correctness is ensured by a hardware safety net mechanism.
A. Zou, et al., “
Voltage-Stacked Power Delivery Systems: Reliability, Efficiency, and Power Management,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1-1, 2020.
AbstractIn today’s manycore processors, energy loss of more than 20% may result from inherent inefficiencies of conventional power delivery system (PDS) design. By stacking multiple voltage domains in series to lower the step-down conversion ratio of the off-chip voltage regulator module (VRM) and reduce energy loss along the path of the power delivery network (PDN), voltage stacking (VS) offers a novel alternative power delivery technique to fundamentally improve power delivery efficiency (PDE). However, voltage stacking suffers from aggravated supply voltage noise from current imbalance, which hinders its adoption. In this paper, we investigate practical voltage stacking implementation in manycore processors to improve power delivery efficiency (PDE) and achieve reliable performance, while maintaining compatibility with advanced power management techniques. We first present the system configuration of a voltage-stacked manycore processor. We then systematically characterize supply voltage noise in voltage stacking, identify global and residual differential currents as its dominant contributors, and calculate the possible worst supply voltage noise. We next propose a hybrid voltage regulation solution, based on a charge-recycling off-chip voltage regulator and distributed integrated voltage regulators, to mitigate supply voltage noise effectively. We also study the compatibility of voltage stacking with higher level power management techniques. Finally, the performance of a voltage-stacked GPU system is comprehensively evaluated. Simulation results show that our approach can achieve 93.5% power delivery efficiency, reducing the power loss by 13.6% compared to conventional single-layer power delivery system.