Publications

2016
M. Kazdagli, L. Huang, V. J. Reddi, and M. Tiwari, “EMMA: A New Platform to Evaluate Hardware-based Mobile Malware Analyses,” arXiv preprint arXiv:1603.03086, 2016.Abstract

Hardware-based malware detectors (HMDs) are a key emerging technology to build trustworthy computing platforms, especially mobile platforms. Quantifying the efficacy of HMDs against malicious adversaries is thus an important problem. The challenge lies in that real-world malware typically adapts to defenses, evades being run in experimental settings, and hides behind benign applications. Thus, realizing the potential of HMDs as a line of defense – that has a small and battery-efficient code base – requires a rigorous foundation for evaluating HMDs. To this end, we introduce EMMA—a platform to evaluate the efficacy of HMDs for mobile platforms. EMMA deconstructs malware into atomic, orthogonal actions and introduces a systematic way of pitting different HMDs against a diverse subset of malware hidden inside benign applications. EMMA drives both malware and benign programs with real user-inputs to yield an HMD’s effective operating range— i.e., the malware actions a particular HMD is capable of detecting. We show that small atomic actions, such as stealing a Contact or SMS, have surprisingly large hardware footprints, and use this insight to design HMD algorithms that are less intrusive than prior work and yet perform 24.7% better. Finally, EMMA brings up a surprising new result— obfuscation techniques used by malware to evade static analyses makes them more detectable using HMDs.

Paper
Y. Zhu and V. J. Reddi, “GreenWeb: Language Extensions for Energy-Efficient Mobile Web Computing,” in Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016, vol. 51, no. 6, pp. 145-160. Publisher's VersionAbstract

Web computing is gradually shifting toward mobile devices, in which the energy budget is severely constrained. As a result, Web developers must be conscious of energy efficiency. However, current Web languages provide developers little control over energy consumption. In this paper, we take a first step toward language-level research to enable energy-efficient Web computing. Our key motivation is that mobile systems can wisely budget energy usage if informed with user quality-of-service (QoS) constraints. To do this, programmers need new abstractions. We propose two language abstractions, QoS type and QoS target, to capture two fundamental aspects of user QoS experience. We then present GreenWeb, a set of language extensions that empower developers to easily express the QoS abstractions as program annotations. As a proof of concept, we develop a GreenWeb runtime, which intelligently determines how to deliver specified user QoS expectation while minimizing energy consumption. Overall, GreenWeb shows significant energy savings (29.2% ⇠ 66.0%) over Android’s default Interactive governor with few QoS violations. Our work demonstrates a promising first step toward language innovations for energy-efficient Web computing. Categories and Subject Descriptors D.3.2 [Programming Language]: Language Classifications–Specialized application languages; D.3.3 [Programming Language]: Language Constructs and Features–Constraints Keywords Energy-efficiency, Web, Mobile computing

Paper
M. Halpern, Y. Zhu, and V. J. Reddi, “Mobile Cpu's Rise to Power: Quantifying the Impact of Generational Mobile Cpu Design Trends on Performance, Energy, and User Satisfaction,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, 2016, pp. 64–76. Publisher's VersionAbstract

In this paper, we assess the past, present, and future of mobile CPU design. We study how mobile CPU designs trends have impacted the end-user, hardware design, and the holistic mobile device. We analyze the evolution of ten cutting-edge mobile CPU designs released over the past seven years. Specifically, we report measured performance, power, energy and user satisfaction trends across mobile CPU generations. A key contribution of our work is that we contextualize the mobile CPU’s evolution in terms of user satisfaction, which has largely been absent from prior mobile hardware studies. To bridge the gap between mobile CPU design and user satisfaction, we construct and conduct a novel crowdsourcing study that spans over 25,000 survey participants using the Amazon Mechanical Turk service. Our methodology allows us to identify what mobile CPU design techniques provide the most benefit to the end-user’s quality of user experience. Our results quantitatively demonstrate that CPUs play a crucial role in modern mobile system-on-chips (SoCs). Over the last seven years, both single- and multicore performance improvements have contributed to end-user satisfaction by reducing user-critical application response latencies. Mobile CPUs aggressively adopted many power-hungry desktoporiented design techniques to reach these performance levels. Unlike other smartphone components (e.g. display and radio) whose peak power consumption has decreased over time, the mobile CPU’s peak power consumption has steadily increased. As the limits of technology scaling restrict the ability of desktop-like scaling to continue for mobile CPUs, specialized accelerators appear to be a promising alternative that can help sustain the power, performance, and energy improvements that mobile computing necessitates. Such a paradigm shift will redefine the role of the CPU within future SoCs, which merit several design considerations based on our findings.

Paper
M. Kazdagli, V. J. Reddi, and M. Tiwari, “Quantifying and Improving the Efficiency of Hardware-Based Mobile Malware Detectors,” in The 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016, pp. 37. Publisher's VersionAbstract

Hardware-based malware detectors (HMDs) are a key emerging technology to build trustworthy systems, especially mobile platforms. Quantifying the efficacy of HMDs against malicious adversaries is thus an important problem. The challenge lies in that real-world malware adapts to defenses, evades being run in experimental settings, and hides behind benign applications. Thus, realizing the potential of HMDs as a small and battery-efficient line of defense requires a rigorous foundation for evaluating HMDs. We introduce Sherlock—a white-box methodology that quantifies an HMD’s ability to detect malware and identify the reason why. Sherlock first deconstructs malware into atomic, orthogonal actions to synthesize a diverse malware suite. Sherlock then drives both malware and benign programs with real user-inputs, and compares their executions to determine an HMD’s operating range, i.e., the smallest malware actions an HMD can detect. We show three case studies using Sherlock to not only quantify HMDs’ operating ranges but design better detectors. First, using information about concrete malware actions, we build a discretewavelet transform based unsupervised HMD that outperforms prior work based on power transforms by 24.7% (AUC metric). Second, training a supervised HMD using Sherlock’s diverse malware dataset yields 12.5% better HMDs than past approaches that train on ad-hoc subsets of malware. Finally, Sherlock shows why a malware instance is detectable. This yields a surprising new result—obfuscation techniques used by malware to evade static analyses makes them more detectable using HMDs.

Paper
J. Yang, V. J. Reddi, Y. Zhu, and P. Bailis, “Research for Practice: Web Security and Mobile Web Computing,” ACM Queue, vol. 14, no. 4. ACM, pp. 80, 2016. Publisher's Version
N. Chachmon, D. Richins, R. Cohn, M. Christensson, W. Cui, and V. J. Reddi, “Simulation and Analysis Engine for Scale-Out Workloads,” in Proceedings of the 2016 International Conference on Supercomputing (ICS), 2016, pp. 22. Publisher's VersionAbstract

We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the BIOS, kernel, drivers, and user processes. It can also instrument multiple systems simultaneously using a single instrumentation interface, which is essential for studying scale-out applications. SAE is an x86 instruction set simulator designed specifically to enable rapid prototyping, evaluation, and validation of architectural extensions and program analysis tools using its flexible APIs. It is fast enough to execute full platform workloads—a modern operating system can boot in a few minutes—thus enabling research, evaluation, and validation of complex functionalities related to multicore configurations, virtualization, security, and more. To reach high speeds, SAE couples tightly with a virtual platform and employs both a just-in-time (JIT) compiler that helps simulate simple instructions eciently and a fast interpreter for simulating new or complex instructions. We describe SAE’s architecture and instrumentation engine design and show the framework’s usefulness for single- and multi-system architectural and program analysis studies.

Paper
Y. Zu, W. Huang, I. Paul, and V. J. Reddi, “Ti-States: Processor Power Management in the Temperature Inversion Region,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13. Publisher's VersionAbstract

Temperature inversion is a transistor-level effect that can improve performance when temperature increases. It has largely been ignored in the past because it does not occur in the typical operating region of a processor, but temperature inversion is becoming increasing important in current and future technologies. In this paper, we study temperature inversion’s implications on architecture design, and power and performance management. We present the first public comprehensive measurement-based analysis on the effects of temperature inversion on a real processor, using the AMD A10- 8700P processor as our system under test. We show that the extra timing margin introduced by temperature inversion can provide more than 5% Vdd reduction benefit, and this improvement increases to more than 8% when operating in the near-threshold, low-voltage region. To harness this opportunity, we present Tistates, a power management technique that sets the processor’s voltage based on real-time silicon temperature to improve power efficiency. Ti-states lead to 6% to 12% measured power saving across a range of different temperatures compared to a fixed margin. As technology scales to FD-SOI and FinFET, we show there is an ideal operating temperature for various workloads to maximize the benefits of temperature inversion. The key is to counterbalance leakage power increase at higher temperatures with dynamic power reduction by the Ti-states. The projected optimal temperature is typically around 60°C and yields 8% to 9% chip power saving. The optimal high-temperature can be exploited to reduce design cost and runtime operating power for overall cooling. Our findings are important for power and thermal management in future chips and process technologies.

Keywords-timing margin; temperature inversion; power management; reliability; technology scaling

Paper
2015
V. J. Reddi, M. S. Gupta, G. Holloway, G. - Y. Wei, M. D. Smith, and D. Brooks, “Adaptive Event-Guided System and Method for Avoiding Voltage Emergencies”, US Patent: 8,949,666, 2015.
Y. Zu, C. R. Lefurgy, J. Leng, M. Halpern, M. S. Floyd, and V. J. Reddi, “Adaptive Guardband Scheduling to Improve System-Level Efficiency of the Power7+,” in MICRO-48: The 48th Annual IEEE/ACM International Symposium of Microarchitecture, 2015, pp. 308–321. Publisher's VersionAbstract

The traditional guardbanding approach to ensure processor reliability is becoming obsolete because it always over-provisions voltage and wastes a lot of energy. As a next-generation alternative, adaptive guardbanding dynamically adjusts chip clock frequency and voltage based on timing margin measured at runtime. With adaptive guardbanding, voltage guardband is only provided when needed, thereby promising significant energy eciency improvement. In this paper, we provide the first full-system analysis of adaptive guardbanding’s implications using a POWER7+ multicore. On the basis of a broad collection of hardware measurements, we show the benefits of adaptive guardbanding in a practical setting are strongly dependent upon workload characteristics and chip-wide multicore activity. A key finding is that adaptive guardbanding’s benefits diminish as the number of active cores increases, and they are highly dependent upon the workload running. Through a series of analysis, we show these high-level system e↵ects are the result of interactions between the application characteristics, architecture and the underlying voltage regulator module’s loadline e↵ect and IR drop e↵ects. To that end, we introduce adaptive guardband scheduling to reclaim adaptive guardbanding’s e- ciency under di↵erent enterprise scenarios. Our solution reduces processor power consumption by 6.2% over a highly optimized system, e↵ectively doubling adaptive guardbanding’s original improvement. Our solution also avoids malicious workload mappings to guarantee application QoS in the face of adaptive guardbanding hardware’s variable performance.

PDF
Y. Zhu, M. Halpern, and V. J. Reddi, “Event-Based Scheduling for Energy-Efficient QoS (EQoS) in Mobile Web Applications,” in 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 137–149. Publisher's VersionAbstract

Mobile Web applications have become an integral part of our society. They pose a high demand for application quality of service (QoS). However, the energy-constrained nature of mobile devices makes optimizing for QoS difficult. Prior art on energy efficiency optimizations has only focused on the trade-off between raw performance and energy consumption, ignoring the application QoS characteristics. In this paper, we propose the concept of energy-efficient QoS (eQoS) to capture the trade-off between QoS and energy consumption. Given the fundamental event-driven nature of mobile Web applications, we further propose event-based scheduling as an optimization framework for eQoS. The event-based scheduling automatically reasons about users’ QoS requirements, and accurately slacks the events’ execution time to save energy without violating end users’ experience. We demonstrate a working prototype using the Google Chromium and V8 framework on the Samsung Exynos 5410 SoC (used in the Galaxy S4 smartphone). Based on real hardware and software measurements, we achieve 41.2% energy saving with only 0.4% of QoS violations perceptible to end users.

Paper
J. Leng, Y. Zu, and V. J. Reddi, “Gpu Voltage Noise: Characterization and Hierarchical Smoothing of Spatial and Temporal Voltage Noise Interference in Gpu Architectures,” in 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 161–173. Publisher's VersionAbstract

Energy efficiency is undoubtedly important for GPU architectures. Besides the traditionally explored energy-efficiency optimization techniques, exploiting the supply voltage guardband remains a promising yet unexplored opportunity. Our hardware measurements show that up to 23% of the nominal supply voltage can be eliminated to improve GPU energy efficiency by as much as 25%. The key obstacle for exploiting this opportunity lies in understanding the characteristics and root causes of large voltage droops in GPU architectures and subsequently smoothing them away without severe performance penalties. The GPU’s manycore nature complicates the voltage noise phenomenon, and its distinctive architecture features from the CPU necessitate a GPU-specific voltage noise analysis. In this paper, we make the following contributions. First, we provide a voltage noise categorization framework to identify, characterize, and understand voltage noise in the manycore GPU architecture. Second, we perform a microarchitecture-level voltage-droop root-cause analysis for the two major droop types we identify, namely the local first-order droop and the global second-order droop. Third, on the basis of our categorization and characterization, we propose a hierarchical voltage smoothing mechanism that mitigates each type of voltage droop. Our evaluation shows it can reduce up to 31% worst-case droop, which translates to 11.8% core-level and 7.8% processor-level energy reduction

Paper
D. Richins, Y. Zhu, M. Halpern, and V. J. Reddi, “Locality Lost: Unlocking the Performance of Event-Driven Servers,” in International Symposium on Microarchitecture, 2015.Abstract

Server-side Web applications are in the midst of a software evolution. Application developers are turning away from the established thread-per-request model, where each request gets a dedicated thread on the server, and toward event-driven programming platforms, which promise improved scalability and CPU utilization [1]. In response, we perform a microarchitectural analysis of these applications in current server processors and identify several serious performance bottlenecks and optimization opportunities for future processor designs.

Paper
Y. Zhu, D. Richins, M. Halpern, and V. J. Reddi, “Microarchitectural Implications of Event-Driven Server-Side Web Applications,” in Proceedings of the 48th International Symposium on Microarchitecture, 2015, pp. 762–774. Publisher's VersionAbstract

Enterprise Web applications are moving towards serverside scripting using managed languages. Within this shifting context, event-driven programming is emerging as a crucial programming model to achieve scalability. In this paper, we study the microarchitectural implications of server-side scripting, JavaScript in particular, from a unique event-driven programming model perspective. Using the Node.js framework, we come to several critical microarchitectural conclusions. First, unlike traditional server-workloads such as CloudSuite and BigDataBench that are based on the conventional threadbased execution model, event-driven applications are heavily single-threaded, and as such they require significant singlethread performance. Second, the single-thread performance is severely limited by the front-end inefficiencies of today’s server processor microarchitecture, ultimately leading to overall execution inefficiencies. The front-end inefficiencies stem from the unique combination of limited intra-event code reuse and large inter-event reuse distance. Third, through a deep understanding of event-specific characteristics, architects can mitigate the front-end inefficiencies of the managed-languagebased event-driven execution via a combination of instruction cache insertion policy and prefetcher.

Paper
M. Halpern, Y. Zhu, R. Peri, and V. J. Reddi, “Mosaic: Cross-Platform User-Interaction Record and Replay for the Fragmented Android Ecosystem,” in Performance Analysis of Systems and Software (ISPASS), 2015 IEEE International Symposium on, 2015, pp. 215–224. Publisher's VersionAbstract

In contrast to traditional computing systems, such as desktops and servers, that are programmed to perform “compute-bound” and “run-to-completion” tasks, mobile applications are designed for user interactivity. Factoring user interactivity into computer system design and evaluation is important, yet possesses many challenges. In particular, systematically studying interactive mobile applications across the diverse set of mobile devices available today is difficult due to the mobile device fragmentation problem. At the time of writing, there are 18,796 distinct Android mobile devices on the market and will only continue to increase in the future. Differences in screen sizes, resolutions and operating systems impose different interactivity requirements, making it difficult to uniformly study these systems. We present Mosaic, a cross-platform, timing-accurate record and replay tool for Android-based mobile devices. Mosaic overcomes device fragmentation through a novel virtual screen abstraction. User interactions are translated from a physical device into a platform-agnostic intermediate representation before translation to a target system. The intermediate representation is human-readable, which allows Mosaic users to modify previously recorded traces or even synthesize their own user interactive sessions from scratch. We demonstrate that Mosaic allows user interaction traces to be recorded on emulators, smartphones, tablets, and development boards and replayed on other devices. Using Mosaic we were able to replay 45 different Google Play applications across multiple devices, and also show that we can perform cross-platform performance comparisons between two different processors under identical user interactions.

Paper
Y. Zhu, M. Halpern, and V. J. Reddi, “The Role of the Cpu in Energy-Efficient Mobile Web Browsing,” IEEE Micro, vol. 35, no. 1, pp. 26–33, 2015. Publisher's VersionAbstract

THE MOBILE CPU IS STARTING TO NOTICEABLY IMPACT WEB BROWSING PERFORMANCE AND ENERGY CONSUMPTION. ACHIEVING ENERGY-EFFICIENT MOBILE WEB BROWSING REQUIRES CONSIDERING BOTH CPU AND NETWORK CAPABILITIES. RESEARCHERS MUST LEVERAGE INTERACTIONS BETWEEN THE CPU AND NETWORK TO DELIVER HIGH MOBILE WEB PERFORMANCE WHILE MAINTAINING A LOW ENERGY FOOTPRINT. DESIGNING FUTURE HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE WEB CLIENTS IMPLIES LOOKING BEYOND INDIVIDUAL COMPONENTS AND TAKING A FULL SYSTEM PERSPECTIVE.

Paper
J. Leng, A. Buyuktosunoglu, R. Bertran, P. Bose, and V. J. Reddi, “Safe Limits on Voltage Reduction Efficiency in GPUs: A Direct Measurement Approach,” in Microarchitecture (MICRO), 2015 48th Annual IEEE/ACM International Symposium on, 2015, pp. 294–307. Publisher's VersionAbstract

Energy eciency of GPU architectures has emerged as an important aspect of computer system design. In this paper, we explore the energy benefits of reducing the GPU chip’s voltage to the safe limit, i.e. Vmin point. We perform such a study on several commercial o↵- the-shelf GPU cards. We find that there exists about 20% voltage guardband on those GPUs spanning two architectural generations, which, if “eliminated” completely, can result in up to 25% energy savings on one of the studied GPU cards. The exact improvement magnitude depends on the program’s available guardband, because our measurement results unveil a program dependent Vmin behavior across the studied programs. We make fundamental observations about the programdependent Vmin behavior. We experimentally determine that the voltage noise has a larger impact on Vmin compared to the process and temperature variation, and the activities during the kernel execution cause large voltage droops. From these findings, we show how to use a kernel’s microarchitectural performance counters to predict its Vmin value accurately. The average and maximum prediction errors are 0.5% and 3%, respectively. The accurate Vmin prediction opens up new possibilities of a cross-layer dynamic guardbanding scheme for GPUs, in which software predicts and manages the voltage guardband, while the functional correctness is ensured by a hardware safety net mechanism.

Paper
2014
J. Leng, Y. Zu, and V. J. Reddi, “Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler Gpu Architecture,” Proc. of Silicon Errors in Logic – System Effects (SELSE), 2014.Abstract

Energy efficiency of GPU architectures has emerged as an important design criterion for both NVIDIA and AMD. In this paper, we explore the benefits of scaling a generalpurpose GPU (GPGPU) core’s supply voltage to the near limits of execution failure. We find that as much as 21% of NVIDIA GTX 680’s core supply voltage guardband can be eliminated to achieve significant energy efficiency improvement. Measured results indicate that the energy improvements can be as high as 25% without any performance loss. The challenge, however, is to understand what impacts the minimum voltage guardband and how the guardband can be scaled without compromising correctness. We show that GPU microarchitectural activity patterns caused by different program characteristics are the root cause(s) of the large voltage guardband. We also demonstrate how microarchitecture-level parameters, such as clock frequency and the number of cores, impact the guardband. We hope our preliminary analysis lays the groundwork for future research.

Paper
C. Zhou, X. Wang, W. Xu, Y. Zhu, V. J. Reddi, and C. H. Kim, “Estimation of Instantaneous Frequency Fluctuation in a Fast DVFS Environment Using an Empirical BTI Stress-Relaxation Model,” in Proceedings of the International Reliability Physics Symposium (IRPS), 2014, pp. 2D–2. Publisher's VersionAbstract

This work proposes an empirical Bias Temperature Instability (BTI) stress-relaxation model based on the superposition property. The model was used to study the instantaneous frequency fluctuation in a fast Dynamic Voltage and Frequency Scaling (DVFS) environment. VDD and operating frequency information for this study were collected from an ARM Cortex A15 processor based development board running an Android operating system. Simulation results show that the frequency peaks and dips are functions of mainly two parameters: (1) the amount of stress or recovery experienced by the circuit prior to the VDD switching and (2) the frequency sensitivity to device aging after the VDD switching.

Paper Presentation
Y. Zhu, A. Srikanth, J. Leng, and V. J. Reddi, “Exploiting Webpage Characteristics for Energy-Efficient Mobile Web Browsing,” Computer Architecture Letters (CAL), vol. 13, no. 1, pp. 33–36, 2014. Publisher's VersionAbstract

Web browsing on mobile devices is undoubtedly the future. However, with the increasing complexity of webpages, the mobile device’s computation capability and energy consumption become major pitfalls for a satisfactory user experience. In this paper, we propose a mechanism to effectively leverage processor frequency scaling in order to balance the performance and energy consumption of mobile web browsing. This mechanism explores the performance and energy tradeoff in webpage loading, and schedules webpage loading according to the webpages’ characteristics, using the different frequencies. The proposed solution achieves 20.3% energy saving compared to the performance mode, and improves webpage loading performance by 37.1% compared to the battery saving mode.

Index Terms—Energy, EDP, Cutoff, Performance, Webpages

Paper Presentation (Best of CAL)
J. Leng, Y. Zu, M. Rhu, M. Gupta, and V. J. Reddi, “GPUVolt: Modeling and Characterizing Voltage Noise in Gpu Architectures,” in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2014, pp. 141–146.Abstract

Voltage noise is a major obstacle in improving processor energy eciency because it necessitates large operating voltage guardbands that increase overall power consumption and limit peak performance. Identifying the leading root causes of voltage noise is essential to minimize the unnecessary guardband and maximize the overall energy eciency. We provide the first-ever modeling and characterization of voltage noise in GPUs based on a new simulation infrastructure called GPUVolt. Using it, we identify the key intracore microarchitectural components (e.g., the register file and special functional units) that significantly impact the GPU’s voltage noise. We also demonstrate that intercore-aligned microarchitectural activity detrimentally impacts the chipwide worst-case voltage droops. On the basis of these findings, we propose a combined register-file and execution-unit throttling mechanism that smooths GPU voltage noise and reduces the guardband requirement by as much as 29%.

Categories and Subject Descriptors

C.4 [Performance of Systems]: Modeling techniques, Reliability, availability, and serviceability

Keywords

di/dt, inductive noise, GPU architecture, GPU reliability

Paper

Pages