Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encoding of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at very low precision (≤ 8-bit) across a diverse set of state-of-the-art neural network topologies. And notably, AdaptivFloat is seen surpassing baseline FP32 performance by up to +0.3 in BLEU score and -0.75 in word error rate at weight bit widths that are ≤ 8-bit. Experimental results on a deep neural network (DNN) hardware accelerator, exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9× and 1.14×, respectively, that of equivalent bit width integer-based accelerator variants.
Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low precision as their shrunken dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present an algorithm-hardware co-design centered around a novel floating-point inspired number format, AdaptivFloat, that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encodings of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at low bit precision (≤8-bit) across a diverse set of state-of-the-art neural networks, exhibiting narrow to wide weight distribution. Notably, at 4-bit weight precision, only a 2.1 degradation in BLEU score is observed on the AdaptivFloat-quantized Transformer network compared to total accuracy loss when encoded in the above-mentioned prominent datatypes. Furthermore, experimental results on a deep neural network (DNN) processing element (PE), exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9× and 1.14×, width, respectively that of an equivalent bit NVDLA-like integer-based PE.
Accelerators make the task of building systems that are re-silient against transient errors like voltage noise and soft errors hard. Architects integrate accelerators into the system as black box third-party IP components. So a fault in one or more accelerators may threaten the system's reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Existing solutions that assure system reliability come at the cost of sacrificing accelerator generality, efficiency, and incur significant overhead, even in the absence of errors. To over-come these drawbacks, we examine reliability management of accelerator systems via hardware-software co-design, coupling an efficient architecture design with compiler and run-time support, to cope with transient errors. We introduce asymmetric resilience that architects reliability at the system level, centered around a hardened CPU, rather than at the accelerator level. At runtime, the system exploits task-level idempotency to contain accelerator errors and use memory protection instead of taking checkpoints to mitigate over-heads. We also leverage the fact that errors rarely occur in systems, and exploit the trade-off between error recovery performance and improved error-free performance to enhance system efficiency. Using GPUs, which are at the fore-front of accelerator systems, we demonstrate how our system architecture manages reliability in both integrated and discrete systems, under voltage-noise and soft-error related faults, leading to extremely low overhead (less than 1%) and substantial gains (20% energy savings on average).
Machine Learning (ML) is widely used today in many mobile applications. To preserve user privacy, there is a need to perform ML inference on the mobile devices. Given that ML inference is a computationally intensive task, the common technique used in mobile devices is offloading the task to a neural accelerator. However, the speed-up gained from offloading these tasks on the accelerators is limited by the overhead of frequent host-accelerator communication. In this paper, we propose a complete end-to-end solution that uses in-pipeline machine learning processing unit for accelerating ML workloads. First we introduce the software infrastructure we developed to support compilation and execution of machine learning models used in TensorFlow Lite framework. Then we discuss the microarchitecture we plan to implement for supporting the execution of our vectorized machine learning kernels.
Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of the most efficient methods to reduce power consumption of a chip. With the galloping adoption of hardware accelerators (i.e., GPUs and FPGAs) in large datacenters and other large-scale computing infrastructures, a comprehensive evaluation of the safe voltage reduction levels for each different chip can be employed for efficient reduction of the total power. We present a survey of recent studies in voltage margins reduction at the system level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands inserted by the silicon vendors can be exploited in all devices for significant power savings. On average, voltage reduction can reach 12% in multicore CPUs, 20% in manycore GPUs and 39% in FPGAs.
Artificial intelligence and machine learning are experiencing widespread adoption in the industry, academia, and even public consciousness. This has been driven by the rapid advances in the applications and accuracy of AI through increasingly complex algorithms and models; this, in turn, has spurred research into developing specialized hardware AI accelerators. The rapid pace of the advances makes it easy to miss the forest for the trees: they are often developed and evaluated in a vacuum without considering the full application environment in which they must eventually operate. In this paper, we deploy and characterize Face Recognition, an AI-centric edge video analytics application built using open source and widely adopted infrastructure and ML tools. We evaluate its holistic, end-to-end behavior in a production-size edge data center and reveal the “AI tax” for all the processing that is involved. Even though the application is built around state-of-the-art AI and ML algorithms, it relies heavily on pre- and post-processing code which must be executed on a general-purpose CPU. As AI-centric applications start to reap the acceleration promised by so many accelerators, we find they impose stresses on the underlying software infrastructure and the data center’s capabilities: storage and network bandwidth become major bottlenecks with increasing AI acceleration. By not having to serve a wide variety of applications, we show that a purpose-built edge data center can be designed to accommodate the stresses of accelerated AI at 15% lower TCO than one de-rived from homogeneous servers and infrastructure. We also discuss how our conclusions generalize beyond Face Recognition as many AI-centric applications at the edge rely upon the same underlying software and hardware infrastructure.
In this article, we describe the design choices behind MLPerf, a machine learning performance benchmark that has become an industry standard. The first two rounds of the MLPerf Training benchmark helped drive improvements to software-stack performance and scalability, showing a 1.3× speedup in the top 16-chip results despite higher quality targets and a 5.5× increase in system scale. The first round of MLPerf Inference received over 500 benchmark results from 14 different organizations, showing growing adoption.
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.
Energy efficiency of GPU architectures has emerged as an essential aspect of computer system design. In this paper, we explore the energy benefits of reducing the GPU chip’s voltage to the safe limit, i.e., Vmin point, using predictive software techniques. We perform such a study on several commercial off-the-shelf GPU cards. We find that there exists about 20% voltage guardband on those GPUs spanning two architectural generations, which, if “eliminated"entirely, can result in up to 25% energy savings on one of the studied GPU cards. Our measurement results unveil a program dependent Vmin behavior across the studied applications, and the exact improvement magnitude depends on the program’s available guardband. We make fundamental observations about the program-dependent Vmin behavior. We experimentally determine that the voltage noise has a more substantial impact on Vmin compared to the process and temperature variation, and the activities during the kernel execution cause large voltage droops. From these findings, we show how to use kernels’ microarchitectural performance counters to predict its Vmin value accurately. The average and maximum prediction errors are 0.5% and 3%, respectively. The accurate Vmin prediction opens up new possibilities of a cross-layer dynamic guardbanding scheme for GPUs, in which software predicts and manages the voltage guardband, while the functional correctness is ensured by a hardware safety net mechanism.
In today’s manycore processors, energy loss of more than 20% may result from inherent inefficiencies of conventional power delivery system (PDS) design. By stacking multiple voltage domains in series to lower the step-down conversion ratio of the off-chip voltage regulator module (VRM) and reduce energy loss along the path of the power delivery network (PDN), voltage stacking (VS) offers a novel alternative power delivery technique to fundamentally improve power delivery efficiency (PDE). However, voltage stacking suffers from aggravated supply voltage noise from current imbalance, which hinders its adoption. In this paper, we investigate practical voltage stacking implementation in manycore processors to improve power delivery efficiency (PDE) and achieve reliable performance, while maintaining compatibility with advanced power management techniques. We first present the system configuration of a voltage-stacked manycore processor. We then systematically characterize supply voltage noise in voltage stacking, identify global and residual differential currents as its dominant contributors, and calculate the possible worst supply voltage noise. We next propose a hybrid voltage regulation solution, based on a charge-recycling off-chip voltage regulator and distributed integrated voltage regulators, to mitigate supply voltage noise effectively. We also study the compatibility of voltage stacking with higher level power management techniques. Finally, the performance of a voltage-stacked GPU system is comprehensively evaluated. Simulation results show that our approach can achieve 93.5% power delivery efficiency, reducing the power loss by 13.6% compared to conventional single-layer power delivery system.