We have already entered the heterogeneous computing era when computing systems harness computational horsepower from not only general purpose CPUs but also other processors such as graphics processing unit (GPU) and hardware accelerators. Performance, power-efficiency, and reliability are three most critical aspects of processors, and there usually exists a tradeoff among them. Accelerators are heavily optimized for performance and power-efficiency rather than reliability. However, it is equally important to ensure overall reliability while introducing accelerators to computing systems. In this paper, we focus on optimizing accelerator’s reliability without adopting the “whac-a-mole” paradigm which develops accelerator-specific reliability optimization. Instead, we advocate maintaining the reliability at the system level, and propose the design paradigm called “asymmetric resilience,” whose principle is to develop the reliable heterogeneous system centering around the CPU architecture. This generic design paradigm eases accelerators away from reliability optimization. We present the design principles and practices for the heterogeneous system that adopt such design paradigm. Following the principles of asymmetric resilience, we demonstrate how to use CPU architecture to handle GPU execution errors, which allows GPU focus on typical case operation for better energy efficiency. We explore the design space and show that the average overhead is only 1% for error-free execution and the overhead increases linearly with error probability.
our contributions are as follows: • We investigate DSA and characterize the effectiveness of category-awareness. • We conduct a limit study to understand the benefit of applying approximation in a perframe manner with category-awareness (category-aware dynamic DSA). • We present the challenges of harnessing DSA and a proof-of-concept runtime.
Voltage stacking (VS) fundamentally improves power delivery efficiency (PDE) by series-stacking multiple voltage domains to eliminate explicit step-down voltage conversion and reduce energy loss along the power delivery path. However, it suffers from aggravated supply noise, preventing its adoption in mainstream computing systems. In this paper, we investigate a practical approach to enabling efficient and reliable power delivery in voltage-stacked manycore systems that can ensure worst-case supply noise reliability without excessive costly over-design. We start by developing an analytical model to capture the essential noise behaviors in VS. It allows us to identify dominant noise contributor and derive the worst-case conditions. With this in-depth understanding, we propose a hybrid voltage regulation solution to effectively mitigate noise with worst-case guarantees. When evaluated with real-world benchmarks, our solution can achieve 93.8% power delivery efficiency, an improvement of 13.9% over the conventional baseline.
Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Among them, Micro Aerial Vehicles (MAVs) have seen an outburst of attention recently, specifically in the area with a demand for autonomy. A key challenge standing in the way of making MAVs autonomous is that researchers lack the comprehensive understanding of how performance, power, and computational bottlenecks affect MAV applications. MAVs must operate under a stringent power budget, which severely limits their flight endurance time. As such, there is a need for new tools, benchmarks, and methodologies to foster the systematic development of autonomous MAVs. In this paper, we introduce the “MAVBench” framework which consists of a closed-loop simulator and an end-to-end application benchmark suite. A closed-loop simulation platform is needed to probe and understand the intra-system (application data flow) and inter-system (system and environment) interactions in MAV applications to pinpoint bottlenecks and identify opportunities for hardware and software co-design and optimization. In addition to the simulator, MAVBench provides a benchmark suite, the first of its kind, consisting of a variety of MAV applications designed to enable computer architects to perform characterization and develop future aerial computing systems. Using our open source, end-to-end experimental platform, we uncover a hidden, and thus far unexpected compute to total system energy relationship in MAVs. Furthermore, we explore the role of compute by presenting three case studies targeting performance, energy and reliability. These studies confirm that an efficient system design can improve MAV’s battery consumption by up to 1.8X.
Mobile computing has grown drastically over the past decade. Despite the rapid pace of advancements, mobile device understanding, benchmarking, and evaluation are still in their infancies, both in industry and academia. This article presents an industry perspective on the challenges facing mobile computer architecture, specifically involving mobile workloads, benchmarking, and experimental methodology, with the hope of fostering new research within the community to address pending problems. These challenges pose a threat to the systematic development of future mobile systems, which, if addressed, can elevate the entire mobile ecosystem to the next level.
Mobile devices have come a long way from the first portable cellular phone developed by Motorola in 1973. Most modern smartphones are good enough to replace desktop computers. A smartphone today has enough computing power to be on par with the fastest supercomputers from the 1990s.
For instance, the Qualcomm Adreno 540 GPU found in the latest smartphones has a peak compute capability of more than 500 Gflops, putting it in competition with supercomputers that were on the TOP500 list in the early to mid-1990s. Mobile computing has experienced an unparalleled level of growth over the past decade. At the time of this writing, there are more than 2 billion mobile devices in the world.1 But perhaps even more importantly, mobile phones are showing no signs of slowing in uptake. In fact, smartphone adoption rates are on the rise. The number of devices is rising as mobile device penetration increases in markets like India and China. It is anticipated that the number of mobile subscribers will grow past 6 billion in the coming years.2 As Figure 1 shows, while the Western European and North American markets are reaching saturation, the vast majority of growth is coming from countries in Asia. Given that only 35 percent of the world’s population has thus far adopted mobile technology, there is still significant room for growth and innovation.
Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Although the researchers in the robotic domain have made rapid progress in recent years, hardware and software architects in the computer architecture community lack the comprehensive understanding of how performance, power, and computational bottlenecks affect UAV applications. Such an understanding enables system architects to design microchips tailored for aerial agents. This paper is an attempt by computer architects to initiate the discussion between the two academic domains by investigating the underlying compute systems’ impact on aerial robotic applications. To do so, we identify performance and energy constraints and examine the impact of various compute knobs such as processor cores and frequency on these constraints. Our experiment show that such knobs allow for up to 5X speed up for a wide class of applications.
Recent advances in cognitive computing have brought widespread excitement for various machine learning–based intelligent services, ranging from autonomous vehicles to smart traffic-light systems. To push such cognitive services closer to reality, recent research has focused extensively on improving the performance, energy efficiency, privacy, and security of cognitive computing platforms.
Among all the issues, a rapidly rising and critical challenge to address is the practice of safe cognitive computing— that is, how to architect machine learning–based systems to be robust against uncertainty and failure to guarantee that they perform as intended without causing harmful behavior. Addressing the safety issue will involve close collaboration among different computing communities, and we believe computer architects must play a key role. In this position paper, we first discuss the meaning of safety and the severe implications of the safety issue in cognitive computing. We then provide a framework to reason about safety, and we outline several opportunities for the architecture community to help make cognitive computing safer.
Despite being employed in burgeoning eforts to improve power delivery eiciency, integrated voltage regulators (IVRs) have yet to be evaluated in a rigorous, systematic, or quantitative manner. To fulill this need, we present Ivory, a high-level design space exploration tool capable of providing accurate conversion eiciency, static performance characteristics, and dynamic transient responses of an IVR-enabled power delivery subsystem (PDS), enabling rapid trade-of exploration at early design stage, approximately 1000x faster than SPICE simulation. We demonstrate and validate Ivory with a wide spectrum of IVR topologies. In addition, we present a case study using Ivory to reveal the optimal PDS conigurations, with underlying power break-downs and area overheads for the GPU manycore architecture, which has yet to embrace IVRs.
Mobile applications are increasingly being built using web technologies as a common substrate to achieve portability and to improve developer productivity. Unfortunately, web applications often incur large performance overhead, directly affecting the user quality-of-service (QoS) experience. Traditional techniques in improving mobile processor performance have mostly been adopting desktop-like design techniques such as increasing single-core microarchitecture complexity and aggressively integrating more cores. However, such a desktop-oriented strategy is likely coming to an end due to the stringent energy and thermal constraints that mobile devices impose. Therefore, we must pivot away from traditional mobile processor design techniques in order to provide sustainable performance improvement while maintaining energy efficiency. In this article, we propose to combine hardware customization and specialization techniques to improve the performance and energy efficiency of mobile web applications. We first perform design-space exploration (DSE) and identify opportunities in customizing existing general-purpose mobile processors, that is, tuning microarchitecture parameters. The thorough DSE also lets us discover sources of energy inefficiency in customized general-purpose architectures. To mitigate these inefficiencies, we propose, synthesize, and evaluate two new domain-specific specializations, called the Style Resolution Unit and the Browser Engine Cache. Our optimizations boost performance and energy efficiency at the same time while maintaining generalpurpose programmability. As emerging mobile workloads increasingly rely more on web technologies, the type of optimizations we propose will become important in the future and are likely to have a long-lasting and widespread impact.
OUR THIRD INSTALLMENT of Research for Practice brings readings spanning programming languages, compilers, privacy, and the mobile Web. First, Jean Yang provides an overview of how to use information flow techniques to build programs that are secure by construction. As Yang writes, information flow is a conceptually simple “clean idea”: the flow of sensitive information across program variables and control statements can be tracked to determine whether information may in fact leak. Making information flow practical is a major challenge, however. Instead of relying on programmers to track information flow, how can compilers and language runtimes be made to do the heavy lifting? How can application writers easily express their privacy policies and understand the implications of a given policy for the set of values that an application user may see? Yang’s set of papers directly addresses these questions via a clever mix of techniques from compilers, systems, and language design. This focus on theory made practical is an excellent topic for RfP
Energy consumption is a key concern for mobile devices. Prior research has focused on the screen and the network as the major sources of energy consumption. Through carefully designed measurement-based experiments, we show that for certain storage-intensive workloads, the storage subsystem on an Android smartphone consumes a significant amount of energy (36%), on par with screen energy consumption. We analyze the energy consumption of different storage primitives, such as sequential and random writes, on two popular mobile file systems, ext4 and F2FS. In addition, since most Android applications use SQLite for storage, we analyze the energy consumption of different SQLite operations. We present several interesting results from our analysis: for example, random writes consume 15× higher energy than sequential writes, and that F2FS consumes half the energy as ext4 for most workloads. We believe our results contribute useful design guidelines for the developers of energy-efficient mobile file systems.
TEMPERATURE INVERSION IS A TRANSISTOR-LEVEL EFFECT THAT IMPROVES PERFORMANCE WHEN TEMPERATURE INCREASES. THIS ARTICLE PRESENTS A COMPREHENSIVE MEASUREMENT-BASED ANALYSIS OF ITS IMPLICATIONS FOR ARCHITECTURE DESIGN AND POWER MANAGEMENT USING THE AMD A10-8700P PROCESSOR. THE AUTHORS PROPOSE TEMPERATURE-INVERSION STATES (TI -STATES) TO HARNESS THE OPPORTUNITIES PROMISED BY TEMPERATURE INVERSION. THEY EXPECT TI -STATES TO BE ABLE TO IMPROVE THE POWER EFFICIENCY OF MANY PROCESSORS MANUFACTURED IN FUTURE CMOS TECHNOLOGIES.
Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phasedivergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling.
To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS’ hardware co
Cognitive cloud services seek to provide end-users with functionalities that have historically required human intellect to complete. End-users expect these services to be both responsive and accurate, which pose conflicting requirements for service providers. Today’s cloud services deployment schemes follow a “one size fits all” scale-out strategy, where multiple instantiations of the same version of the service are used to scale-out and handle all end-users. Meanwhile, many cognitive services are of a statistical nature where deeper exploration yields more accurate results but also requires more processing time. Finding a single service configuration setting that satisfies the latency and accuracy requirements for the largest number of expected end-user requests can be a challenging task. As a result, cognitive cloud service providers are conservatively configured to maximize the number of enduser requests for which a satisfactory latency-accuracy tradeoff can be achieved. Using a production-grade Automatic Speech Recognition cloud service as a representative example to study, we demonstrate the inefficiencies of this single version approach and propose a new service node multi-versioning deployment scheme for cognitive services instead. We present an oracle-based limit study where we show that service node multi-versioning can provide a 2.5X reduction in execution time together with a 24% improvement in accuracy over a traditional single version deployment scheme. We also discuss several design considerations to address when implementing service node multi-versioning.
Hardware-based malware detectors (HMDs) are a key emerging technology to build trustworthy computing platforms, especially mobile platforms. Quantifying the efficacy of HMDs against malicious adversaries is thus an important problem. The challenge lies in that real-world malware typically adapts to defenses, evades being run in experimental settings, and hides behind benign applications. Thus, realizing the potential of HMDs as a line of defense – that has a small and battery-efficient code base – requires a rigorous foundation for evaluating HMDs. To this end, we introduce EMMA—a platform to evaluate the efficacy of HMDs for mobile platforms. EMMA deconstructs malware into atomic, orthogonal actions and introduces a systematic way of pitting different HMDs against a diverse subset of malware hidden inside benign applications. EMMA drives both malware and benign programs with real user-inputs to yield an HMD’s effective operating range— i.e., the malware actions a particular HMD is capable of detecting. We show that small atomic actions, such as stealing a Contact or SMS, have surprisingly large hardware footprints, and use this insight to design HMD algorithms that are less intrusive than prior work and yet perform 24.7% better. Finally, EMMA brings up a surprising new result— obfuscation techniques used by malware to evade static analyses makes them more detectable using HMDs.
Web computing is gradually shifting toward mobile devices, in which the energy budget is severely constrained. As a result, Web developers must be conscious of energy efficiency. However, current Web languages provide developers little control over energy consumption. In this paper, we take a first step toward language-level research to enable energy-efficient Web computing. Our key motivation is that mobile systems can wisely budget energy usage if informed with user quality-of-service (QoS) constraints. To do this, programmers need new abstractions. We propose two language abstractions, QoS type and QoS target, to capture two fundamental aspects of user QoS experience. We then present GreenWeb, a set of language extensions that empower developers to easily express the QoS abstractions as program annotations. As a proof of concept, we develop a GreenWeb runtime, which intelligently determines how to deliver specified user QoS expectation while minimizing energy consumption. Overall, GreenWeb shows significant energy savings (29.2% ⇠ 66.0%) over Android’s default Interactive governor with few QoS violations. Our work demonstrates a promising first step toward language innovations for energy-efficient Web computing. Categories and Subject Descriptors D.3.2 [Programming Language]: Language Classifications–Specialized application languages; D.3.3 [Programming Language]: Language Constructs and Features–Constraints Keywords Energy-efficiency, Web, Mobile computing