Accelerators

Processor design over the past years has evolved from single-core CPUs to multicore CPUs and heterogeneous processors that integrate CPUs and GPUs. Despite all this innovation, demand for computational horsepower continues to surge, and as such, there is a growing need for performance at the system level. But with the end of Dennard scaling, sustaining performance improvements that match the needs through conventional CPU and GPU design enhancements is not cutting it. Therefore, specialized execution units that are orders of magnitude more efficient in power and performance for specific tasks are on the rise. 

A12 Bionic chip die photo

System-on-chip architectures (SoC) integrate multiple specialized execution units into a holistic processor architecture to deliver high performance at a low energy cost. For instance, the above A12 die photo that has been annotated with white boxes to show the number of hardware accelerator units in the iPhone 10 processor. Outside of the modest area consumed by the CPU and GPU, the rest of the chip is dedicated to specialized fixed function units like the Neural Processor Engine (used for machine learning tasks). Mobile SoCs are the harbingers of the future as they already incorporate many specialized execution units.

We study how the integration of various accelerators need to be coordinated and executed to ensure a balanced execution profile in what we call as "accelerator level parallelism." To this end, we develop performance models for accelerator units, investigate coordination strategies across the various accelerators that are often running concurrently and putting stress on shared resources, and in some cases even consider the design of custom accelerator solutions.

Publications

B. Boroujerdian, et al., “The Role of Compute in Autonomous Aerial Vehicles,” arXiv preprint arXiv:1906.10513, 2019.Abstract
Autonomous-mobile cyber-physical machines are part of our future. Specifically, unmanned-aerial-vehicles have seen a resurgence in activity with use-cases such as package delivery. These systems face many challenges such as their low-endurance caused by limited onboard-energy, hence, improving the mission-time and energy are of importance. Such improvements traditionally are delivered through better algorithms. But our premise is that more powerful and efficient onboard-compute should also address the problem. This paper investigates how the compute subsystem, in a cyber-physical mobile machine, such as a Micro Aerial Vehicle, impacts mission-time and energy. Specifically, we pose the question as what is the role of computing for cyber-physical mobile robots? We show that compute and motion are tightly intertwined, hence a close examination of cyber and physical processes and their impact on one another is necessary. We show different impact paths through which compute impacts mission-metrics and examine them using analytical models, simulation, and end-to-end benchmarking. To enable similar studies, we open sourced MAVBench, our tool-set consisting of a closed-loop simulator and a benchmark suite. Our investigations show cyber-physical co-design, a methodology where robot's cyber and physical processes/quantities are developed with one another consideration, similar to hardware-software co-design, is necessary for optimal robot design.
M. D. Hill and V. J. Reddi, “Accelerator-Level Parallelism,” arXiv, vol. arXiv:1907.02064v4 [cs.DC], 2019. arXiv VersionAbstract

Future applications demand more performance, but technology advances have been faltering. A promising approach to further improve computer system performance under energy constraints is to employ hardware accelerators. Already today, mobile systems concurrently employ multiple accelerators in what we call accelerator-level parallelism (ALP). To spread the benefits of ALP more broadly, we charge computer scientists to develop the science needed to best achieve the performance and cost goals of ALP hardware and software.

J. Leng, A. Buyuktosunoglu, R. Bertran, P. Bose, and V. J. Reddi, “Asymmetric Resilience for Accelerator-Rich Systems,” Computer Architecture Letters, 2019.Abstract
Accelerators are becoming popular owing to their exceptional performance and power-efficiency. However, researchers are yet to pay close attention to their reliability---a key challenge as technology scaling makes building reliable systems challenging. A straightforward solution to make accelerators reliable is to design the accelerator from the ground-up to be reliable by itself. However, such a myopic view of the system, where each accelerator is designed in isolation, is unsustainable as the number of integrated accelerators continues to rise in SoCs. To address this challenge, we propose a paradigm called "asymmetric resilience'' that avoids accelerator-specific reliability design. Instead, its core principle is to develop the reliable heterogeneous system around the CPU architecture. We explain the implications of architecting such a system and the modifications needed in a heterogeneous system to adopt such an approach. As an example, we demonstrate how to use asymmetric resilience to handle GPU execution errors using the CPU with minimal overhead. The general principles can be extended to include other accelerators.
M. Hill and V. J. Reddi, “Gables: A Roofline Model for Mobile SoCs,” in Proceedings of the 25th International Symposium on High Performance Computer Architecture (HPCA), 2019.Abstract

Over a billion mobile consumer system-on-chip (SoC) chipsets ship each year. Of these, the mobile consumer market undoubtedly involving smartphones has a significant market share. Most modern smartphones comprise of advanced SoC architectures that are made up of multiple cores, GPS, and many different programmable and fixed-function accelerators connected via a complex hierarchy of interconnects with the goal of running a dozen or more critical software usecases under strict power, thermal and energy constraints. The steadily growing complexity of a modern SoC challenges hardware computer architects on how best to do early stage ideation. Late SoC design typically relies on detailed full-system simulation once the hardware is specified and accelerator software is written or ported. However, early-stage SoC design must often select accelerators before a single line of software is written. To help frame SoC thinking and guide early stage mobile SoC design, in this paper we contribute the Gables model that refines and retargets the Roofline model—designed originally for the performance and bandwidth limits of a multicore chip—to model each accelerator on a SoC, to apportion work concurrently among different accelerators (justified by our usecase analysis), and calculate a SoC performance upper bound. We evaluate the Gables model with an existing SoC and develop several extensions that allow Gables to inform early stage mobile SoC design.

Index Terms—Accelerator architectures, Mobile computing, Processor architecture, System-on-Chip

V. J. Reddi, H. Yoon, and A. Knies, “Two Billion Devices and Counting,” IEEE Micro, vol. 38, no. 1, pp. 6–21, 2018. Publisher's VersionAbstract

Mobile computing has grown drastically over the past decade. Despite the rapid pace of advancements, mobile device understanding, benchmarking, and evaluation are still in their infancies, both in industry and academia. This article presents an industry perspective on the challenges facing mobile computer architecture, specifically involving mobile workloads, benchmarking, and experimental methodology, with the hope of fostering new research within the community to address pending problems. These challenges pose a threat to the systematic development of future mobile systems, which, if addressed, can elevate the entire mobile ecosystem to the next level.

Mobile devices have come a long way from the first portable cellular phone developed by Motorola in 1973. Most modern smartphones are good enough to replace desktop computers. A smartphone today has enough computing power to be on par with the fastest supercomputers from the 1990s.

For instance, the Qualcomm Adreno 540 GPU found in the latest smartphones has a peak compute capability of more than 500 Gflops, putting it in competition with supercomputers that were on the TOP500 list in the early to mid-1990s. Mobile computing has experienced an unparalleled level of growth over the past decade. At the time of this writing, there are more than 2 billion mobile devices in the world.1 But perhaps even more importantly, mobile phones are showing no signs of slowing in uptake. In fact, smartphone adoption rates are on the rise. The number of devices is rising as mobile device penetration increases in markets like India and China. It is anticipated that the number of mobile subscribers will grow past 6 billion in the coming years.2 As Figure 1 shows, while the Western European and North American markets are reaching saturation, the vast majority of growth is coming from countries in Asia. Given that only 35 percent of the world’s population has thus far adopted mobile technology, there is still significant room for growth and innovation.

M. Halpern, Y. Zhu, and V. J. Reddi, “Mobile Cpu's Rise to Power: Quantifying the Impact of Generational Mobile Cpu Design Trends on Performance, Energy, and User Satisfaction,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, 2016, pp. 64–76. Publisher's VersionAbstract

In this paper, we assess the past, present, and future of mobile CPU design. We study how mobile CPU designs trends have impacted the end-user, hardware design, and the holistic mobile device. We analyze the evolution of ten cutting-edge mobile CPU designs released over the past seven years. Specifically, we report measured performance, power, energy and user satisfaction trends across mobile CPU generations. A key contribution of our work is that we contextualize the mobile CPU’s evolution in terms of user satisfaction, which has largely been absent from prior mobile hardware studies. To bridge the gap between mobile CPU design and user satisfaction, we construct and conduct a novel crowdsourcing study that spans over 25,000 survey participants using the Amazon Mechanical Turk service. Our methodology allows us to identify what mobile CPU design techniques provide the most benefit to the end-user’s quality of user experience. Our results quantitatively demonstrate that CPUs play a crucial role in modern mobile system-on-chips (SoCs). Over the last seven years, both single- and multicore performance improvements have contributed to end-user satisfaction by reducing user-critical application response latencies. Mobile CPUs aggressively adopted many power-hungry desktoporiented design techniques to reach these performance levels. Unlike other smartphone components (e.g. display and radio) whose peak power consumption has decreased over time, the mobile CPU’s peak power consumption has steadily increased. As the limits of technology scaling restrict the ability of desktop-like scaling to continue for mobile CPUs, specialized accelerators appear to be a promising alternative that can help sustain the power, performance, and energy improvements that mobile computing necessitates. Such a paradigm shift will redefine the role of the CPU within future SoCs, which merit several design considerations based on our findings.