J. Leng, A. Buyuktosunoglu, R. Bertran, P. Bose, and V. J. Reddi, Asymmetric Resilience: Rethinking Reliability for Accelerator-Rich Systems. IBM, 2018.Abstract
We have already entered the heterogeneous computing era when computing systems harness computational horsepower from not only general purpose CPUs but also other processors such as graphics processing unit (GPU) and hardware accelerators. Performance, power-efficiency, and reliability are three most critical aspects of processors, and there usually exists a tradeoff among them. Accelerators are heavily optimized for performance and power-efficiency rather than reliability. However, it is equally important to ensure overall reliability while introducing accelerators to computing systems. In this paper, we focus on optimizing accelerator’s reliability without adopting the “whac-a-mole” paradigm which develops accelerator-specific reliability optimization. Instead, we advocate maintaining the reliability at the system level, and propose the design paradigm called “asymmetric resilience,” whose principle is to develop the reliable heterogeneous system centering around the CPU architecture. This generic design paradigm eases accelerators away from reliability optimization. We present the design principles and practices for the heterogeneous system that adopt such design paradigm. Following the principles of asymmetric resilience, we demonstrate how to use CPU architecture to handle GPU execution errors, which allows GPU focus on typical case operation for better energy efficiency. We explore the design space and show that the average overhead is only 1% for error-free execution and the overhead increases linearly with error probability.
T. - W. Chin, C. - L. Yu, M. Halpern, H. Genc, S. - L. Tsao, and V. J. Reddi, “Domain-Specific Approximation for Object Detection,” IEEE Micro, vol. 38, no. 1, pp. 31–40, 2018. Publisher's VersionAbstract

In summary,

our contributions are as follows: • We investigate DSA and characterize the effectiveness of category-awareness. • We conduct a limit study to understand the benefit of applying approximation in a perframe manner with category-awareness (category-aware dynamic DSA). • We present the challenges of harnessing DSA and a proof-of-concept runtime.

A. Zou, J. Leng, X. He, Y. Zu, V. J. Reddi, and X. Zhang, “Efficient and Reliable Power Delivery in Voltage-Stacked Manycore System With Hybrid Charge-Recycling Regulators,” in 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 2018, pp. 1–6. Publisher's VersionAbstract

Voltage stacking (VS) fundamentally improves power delivery efficiency (PDE) by series-stacking multiple voltage domains to eliminate explicit step-down voltage conversion and reduce energy loss along the power delivery path. However, it suffers from aggravated supply noise, preventing its adoption in mainstream computing systems. In this paper, we investigate a practical approach to enabling efficient and reliable power delivery in voltage-stacked manycore systems that can ensure worst-case supply noise reliability without excessive costly over-design. We start by developing an analytical model to capture the essential noise behaviors in VS. It allows us to identify dominant noise contributor and derive the worst-case conditions. With this in-depth understanding, we propose a hybrid voltage regulation solution to effectively mitigate noise with worst-case guarantees. When evaluated with real-world benchmarks, our solution can achieve 93.8% power delivery efficiency, an improvement of 13.9% over the conventional baseline.

B. Boroujerdian, H. Genc, S. Krishnan, W. Cui, A. Faust, and V. J. Reddi, “MAVBench: Micro Aerial Vehicle Benchmarking,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2018.Abstract

Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Among them, Micro Aerial Vehicles (MAVs) have seen an outburst of attention recently, specifically in the area with a demand for autonomy. A key challenge standing in the way of making MAVs autonomous is that researchers lack the comprehensive understanding of how performance, power, and computational bottlenecks affect MAV applications. MAVs must operate under a stringent power budget, which severely limits their flight endurance time. As such, there is a need for new tools, benchmarks, and methodologies to foster the systematic development of autonomous MAVs. In this paper, we introduce the “MAVBench” framework which consists of a closed-loop simulator and an end-to-end application benchmark suite. A closed-loop simulation platform is needed to probe and understand the intra-system (application data flow) and inter-system (system and environment) interactions in MAV applications to pinpoint bottlenecks and identify opportunities for hardware and software co-design and optimization. In addition to the simulator, MAVBench provides a benchmark suite, the first of its kind, consisting of a variety of MAV applications designed to enable computer architects to perform characterization and develop future aerial computing systems. Using our open source, end-to-end experimental platform, we uncover a hidden, and thus far unexpected compute to total system energy relationship in MAVs. Furthermore, we explore the role of compute by presenting three case studies targeting performance, energy and reliability. These studies confirm that an efficient system design can improve MAV’s battery consumption by up to 1.8X.

V. J. Reddi, “Mobile SoCs: The Wild West of Domain Specific Architectures,” Mobile SoCs: The Wild West of Domain Specific Architectures. 2018. SIGARCH Computer Architecture Today.
V. J. Reddi, H. Yoon, and A. Knies, “Two Billion Devices and Counting,” IEEE Micro, vol. 38, no. 1, pp. 6–21, 2018. Publisher's VersionAbstract

Mobile computing has grown drastically over the past decade. Despite the rapid pace of advancements, mobile device understanding, benchmarking, and evaluation are still in their infancies, both in industry and academia. This article presents an industry perspective on the challenges facing mobile computer architecture, specifically involving mobile workloads, benchmarking, and experimental methodology, with the hope of fostering new research within the community to address pending problems. These challenges pose a threat to the systematic development of future mobile systems, which, if addressed, can elevate the entire mobile ecosystem to the next level.

Mobile devices have come a long way from the first portable cellular phone developed by Motorola in 1973. Most modern smartphones are good enough to replace desktop computers. A smartphone today has enough computing power to be on par with the fastest supercomputers from the 1990s.

For instance, the Qualcomm Adreno 540 GPU found in the latest smartphones has a peak compute capability of more than 500 Gflops, putting it in competition with supercomputers that were on the TOP500 list in the early to mid-1990s. Mobile computing has experienced an unparalleled level of growth over the past decade. At the time of this writing, there are more than 2 billion mobile devices in the world.1 But perhaps even more importantly, mobile phones are showing no signs of slowing in uptake. In fact, smartphone adoption rates are on the rise. The number of devices is rising as mobile device penetration increases in markets like India and China. It is anticipated that the number of mobile subscribers will grow past 6 billion in the coming years.2 As Figure 1 shows, while the Western European and North American markets are reaching saturation, the vast majority of growth is coming from countries in Asia. Given that only 35 percent of the world’s population has thus far adopted mobile technology, there is still significant room for growth and innovation.

B. Boroujerdian, H. Genc, S. Krishnan, A. Faust, and V. J. Reddi, “Why Compute Matters for UAV Energy Efficiency?” in 2nd International Symposium on Aerial Robotics, 2018, no. 6.Abstract

Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Although the researchers in the robotic domain have made rapid progress in recent years, hardware and software architects in the computer architecture community lack the comprehensive understanding of how performance, power, and computational bottlenecks affect UAV applications. Such an understanding enables system architects to design microchips tailored for aerial agents. This paper is an attempt by computer architects to initiate the discussion between the two academic domains by investigating the underlying compute systems’ impact on aerial robotic applications. To do so, we identify performance and energy constraints and examine the impact of various compute knobs such as processor cores and frequency on these constraints. Our experiment show that such knobs allow for up to 5X speed up for a wide class of applications.

Y. Zhu, et al., “Cognitive Computing Safety: The New Horizon for Reliability/The Design and Evolution of Deep Learning Workloads,” IEEE Micro, no. 1, pp. 15–21, 2017. Publisher's VersionAbstract

Recent advances in cognitive computing have brought widespread excitement for various machine learning–based intelligent services, ranging from autonomous vehicles to smart traffic-light systems. To push such cognitive services closer to reality, recent research has focused extensively on improving the performance, energy efficiency, privacy, and security of cognitive computing platforms.

Among all the issues, a rapidly rising and critical challenge to address is the practice of safe cognitive computing— that is, how to architect machine learning–based systems to be robust against uncertainty and failure to guarantee that they perform as intended without causing harmful behavior. Addressing the safety issue will involve close collaboration among different computing communities, and we believe computer architects must play a key role. In this position paper, we first discuss the meaning of safety and the severe implications of the safety issue in cognitive computing. We then provide a framework to reason about safety, and we outline several opportunities for the architecture community to help make cognitive computing safer.

V. J. Reddi, “A Decade of Mobile Computing,” SIGARCH Computer Architecture Today Blog. 2017. SIGARCH Computer Architecture Today.
H. Genc, Y. Zu, T. - W. Chin, M. Halpern, and V. J. Reddi, “Flying IoT: Toward Low-Power Vision in the Sky,” IEEE Micro, vol. 37, no. 6, pp. 40–51, 2017. Publisher's Version Paper
A. Zou, et al., “Ivory: Early-Stage Design Space Exploration Tool for Integrated Voltage Regulators,” in Proceedings of the 54th Annual Design Automation Conference (DAC), 2017, pp. 1. Publisher's VersionAbstract

Despite being employed in burgeoning eforts to improve power delivery eiciency, integrated voltage regulators (IVRs) have yet to be evaluated in a rigorous, systematic, or quantitative manner. To fulill this need, we present Ivory, a high-level design space exploration tool capable of providing accurate conversion eiciency, static performance characteristics, and dynamic transient responses of an IVR-enabled power delivery subsystem (PDS), enabling rapid trade-of exploration at early design stage, approximately 1000x faster than SPICE simulation. We demonstrate and validate Ivory with a wide spectrum of IVR topologies. In addition, we present a case study using Ivory to reveal the optimal PDS conigurations, with underlying power break-downs and area overheads for the GPU manycore architecture, which has yet to embrace IVRs.


Y. Zhu and V. J. Reddi, “Optimizing General-Purpose Cpus for Energy-Efficient Mobile Web Computing,” ACM Transactions on Computer Systems (TOCS), vol. 35, no. 1, pp. 1, 2017. Publisher's VersionAbstract

Mobile applications are increasingly being built using web technologies as a common substrate to achieve portability and to improve developer productivity. Unfortunately, web applications often incur large performance overhead, directly affecting the user quality-of-service (QoS) experience. Traditional techniques in improving mobile processor performance have mostly been adopting desktop-like design techniques such as increasing single-core microarchitecture complexity and aggressively integrating more cores. However, such a desktop-oriented strategy is likely coming to an end due to the stringent energy and thermal constraints that mobile devices impose. Therefore, we must pivot away from traditional mobile processor design techniques in order to provide sustainable performance improvement while maintaining energy efficiency. In this article, we propose to combine hardware customization and specialization techniques to improve the performance and energy efficiency of mobile web applications. We first perform design-space exploration (DSE) and identify opportunities in customizing existing general-purpose mobile processors, that is, tuning microarchitecture parameters. The thorough DSE also lets us discover sources of energy inefficiency in customized general-purpose architectures. To mitigate these inefficiencies, we propose, synthesize, and evaluate two new domain-specific specializations, called the Style Resolution Unit and the Browser Engine Cache. Our optimizations boost performance and energy efficiency at the same time while maintaining generalpurpose programmability. As emerging mobile workloads increasingly rely more on web technologies, the type of optimizations we propose will become important in the future and are likely to have a long-lasting and widespread impact.


V. J. Reddi and Y. Zhu, “Research for Practice: Web Security and Mobile Web Computing,” Communications of the ACM (CACM), 2017.Abstract

OUR THIRD INSTALLMENT of Research for Practice brings readings spanning programming languages, compilers, privacy, and the mobile Web. First, Jean Yang provides an overview of how to use information flow techniques to build programs that are secure by construction. As Yang writes, information flow is a conceptually simple “clean idea”: the flow of sensitive information across program variables and control statements can be tracked to determine whether information may in fact leak. Making information flow practical is a major challenge, however. Instead of relying on programmers to track information flow, how can compilers and language runtimes be made to do the heavy lifting? How can application writers easily express their privacy policies and understand the implications of a given policy for the set of values that an application user may see? Yang’s set of papers directly addresses these questions via a clever mix of techniques from compilers, systems, and language design. This focus on theory made practical is an excellent topic for RfP


J. Mohan, D. Purohith, M. Halpern, V. Chidambaram, and V. J. Reddi, “Storage on Your Smartphone Uses More Energy Than You Think,” USENIX HotStorage. 2017.Abstract

Energy consumption is a key concern for mobile devices. Prior research has focused on the screen and the network as the major sources of energy consumption. Through carefully designed measurement-based experiments, we show that for certain storage-intensive workloads, the storage subsystem on an Android smartphone consumes a significant amount of energy (36%), on par with screen energy consumption. We analyze the energy consumption of different storage primitives, such as sequential and random writes, on two popular mobile file systems, ext4 and F2FS. In addition, since most Android applications use SQLite for storage, we analyze the energy consumption of different SQLite operations. We present several interesting results from our analysis: for example, random writes consume 15× higher energy than sequential writes, and that F2FS consumes half the energy as ext4 for most workloads. We believe our results contribute useful design guidelines for the developers of energy-efficient mobile file systems.


Paper Presentation
Y. Zu, W. Huang, I. Paul, and V. J. Reddi, “Ti-States: Power Management in Active Timing Margin Processors,” IEEE Micro, vol. 37, no. 3, pp. 106–114, 2017. Publisher's VersionAbstract


Y. Liu, et al., “Barrier-Aware Warp Scheduling for Throughput Processors,” in Proceedings of the 2016 International Conference on Supercomputing, 2016, pp. 42. Publisher's VersionAbstract

Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phasedivergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling.

To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS’ hardware co

M. Halpern, T. Mummert, M. Novak, E. Duesterwald, and V. J. Reddi, “The Case for Node Multi-Versioning in Cognitive Cloud Services: Achieving Responsiveness and Accuracy at Datacenter Scale,” Workshop on Cognitive Architectures (CogArch). 2016.Abstract

Cognitive cloud services seek to provide end-users with functionalities that have historically required human intellect to complete. End-users expect these services to be both responsive and accurate, which pose conflicting requirements for service providers. Today’s cloud services deployment schemes follow a “one size fits all” scale-out strategy, where multiple instantiations of the same version of the service are used to scale-out and handle all end-users. Meanwhile, many cognitive services are of a statistical nature where deeper exploration yields more accurate results but also requires more processing time. Finding a single service configuration setting that satisfies the latency and accuracy requirements for the largest number of expected end-user requests can be a challenging task. As a result, cognitive cloud service providers are conservatively configured to maximize the number of enduser requests for which a satisfactory latency-accuracy tradeoff can be achieved. Using a production-grade Automatic Speech Recognition cloud service as a representative example to study, we demonstrate the inefficiencies of this single version approach and propose a new service node multi-versioning deployment scheme for cognitive services instead. We present an oracle-based limit study where we show that service node multi-versioning can provide a 2.5X reduction in execution time together with a 24% improvement in accuracy over a traditional single version deployment scheme. We also discuss several design considerations to address when implementing service node multi-versioning.

M. Kazdagli, L. Huang, V. J. Reddi, and M. Tiwari, “EMMA: A New Platform to Evaluate Hardware-based Mobile Malware Analyses,” arXiv preprint arXiv:1603.03086, 2016.Abstract

Hardware-based malware detectors (HMDs) are a key emerging technology to build trustworthy computing platforms, especially mobile platforms. Quantifying the efficacy of HMDs against malicious adversaries is thus an important problem. The challenge lies in that real-world malware typically adapts to defenses, evades being run in experimental settings, and hides behind benign applications. Thus, realizing the potential of HMDs as a line of defense – that has a small and battery-efficient code base – requires a rigorous foundation for evaluating HMDs. To this end, we introduce EMMA—a platform to evaluate the efficacy of HMDs for mobile platforms. EMMA deconstructs malware into atomic, orthogonal actions and introduces a systematic way of pitting different HMDs against a diverse subset of malware hidden inside benign applications. EMMA drives both malware and benign programs with real user-inputs to yield an HMD’s effective operating range— i.e., the malware actions a particular HMD is capable of detecting. We show that small atomic actions, such as stealing a Contact or SMS, have surprisingly large hardware footprints, and use this insight to design HMD algorithms that are less intrusive than prior work and yet perform 24.7% better. Finally, EMMA brings up a surprising new result— obfuscation techniques used by malware to evade static analyses makes them more detectable using HMDs.

Y. Zhu and V. J. Reddi, “GreenWeb: Language Extensions for Energy-Efficient Mobile Web Computing,” in Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016, vol. 51, no. 6, pp. 145-160. Publisher's VersionAbstract

Web computing is gradually shifting toward mobile devices, in which the energy budget is severely constrained. As a result, Web developers must be conscious of energy efficiency. However, current Web languages provide developers little control over energy consumption. In this paper, we take a first step toward language-level research to enable energy-efficient Web computing. Our key motivation is that mobile systems can wisely budget energy usage if informed with user quality-of-service (QoS) constraints. To do this, programmers need new abstractions. We propose two language abstractions, QoS type and QoS target, to capture two fundamental aspects of user QoS experience. We then present GreenWeb, a set of language extensions that empower developers to easily express the QoS abstractions as program annotations. As a proof of concept, we develop a GreenWeb runtime, which intelligently determines how to deliver specified user QoS expectation while minimizing energy consumption. Overall, GreenWeb shows significant energy savings (29.2% ⇠ 66.0%) over Android’s default Interactive governor with few QoS violations. Our work demonstrates a promising first step toward language innovations for energy-efficient Web computing. Categories and Subject Descriptors D.3.2 [Programming Language]: Language Classifications–Specialized application languages; D.3.3 [Programming Language]: Language Constructs and Features–Constraints Keywords Energy-efficiency, Web, Mobile computing

M. Halpern, Y. Zhu, and V. J. Reddi, “Mobile Cpu's Rise to Power: Quantifying the Impact of Generational Mobile Cpu Design Trends on Performance, Energy, and User Satisfaction,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, 2016, pp. 64–76. Publisher's VersionAbstract

In this paper, we assess the past, present, and future of mobile CPU design. We study how mobile CPU designs trends have impacted the end-user, hardware design, and the holistic mobile device. We analyze the evolution of ten cutting-edge mobile CPU designs released over the past seven years. Specifically, we report measured performance, power, energy and user satisfaction trends across mobile CPU generations. A key contribution of our work is that we contextualize the mobile CPU’s evolution in terms of user satisfaction, which has largely been absent from prior mobile hardware studies. To bridge the gap between mobile CPU design and user satisfaction, we construct and conduct a novel crowdsourcing study that spans over 25,000 survey participants using the Amazon Mechanical Turk service. Our methodology allows us to identify what mobile CPU design techniques provide the most benefit to the end-user’s quality of user experience. Our results quantitatively demonstrate that CPUs play a crucial role in modern mobile system-on-chips (SoCs). Over the last seven years, both single- and multicore performance improvements have contributed to end-user satisfaction by reducing user-critical application response latencies. Mobile CPUs aggressively adopted many power-hungry desktoporiented design techniques to reach these performance levels. Unlike other smartphone components (e.g. display and radio) whose peak power consumption has decreased over time, the mobile CPU’s peak power consumption has steadily increased. As the limits of technology scaling restrict the ability of desktop-like scaling to continue for mobile CPUs, specialized accelerators appear to be a promising alternative that can help sustain the power, performance, and energy improvements that mobile computing necessitates. Such a paradigm shift will redefine the role of the CPU within future SoCs, which merit several design considerations based on our findings.