Publications by Type: Miscellaneous

C. R. Banbury, et al., “Benchmarking TinyML Systems: Challenges and Direction”. 2020.
B. P. Duisterhof, et al., “Learning to Seek: Deep Reinforcement Learning for Phototaxis of a Nano Drone in an Obstacle Field”. 2020.
P. Mattson, et al., “MLPerf Training Benchmark”. 2020.
M. Lam, Z. Yedidia, C. Banbury, and V. J. Reddi, “Quantized Neural Network Inference with Precision Batching”. 2020.
E. Shaotran, J. J. Cruz, and V. J. Reddi, “GLADAS: Gesture Learning for Advanced Driver Assistance Systems”. 2019.
B. Boroujerdian, et al., “The Role of Compute in Autonomous Aerial Vehicles”. 2019.Abstract
Autonomous-mobile cyber-physical machines are part of our future. Specifically, unmanned-aerial-vehicles have seen a resurgence in activity with use-cases such as package delivery. These systems face many challenges such as their low-endurance caused by limited onboard-energy, hence, improving the mission-time and energy are of importance. Such improvements traditionally are delivered through better algorithms. But our premise is that more powerful and efficient onboard-compute should also address the problem. This paper investigates how the compute subsystem, in a cyber-physical mobile machine, such as a Micro Aerial Vehicle, impacts mission-time and energy. Specifically, we pose the question as what is the role of computing for cyber-physical mobile robots? We show that compute and motion are tightly intertwined, hence a close examination of cyber and physical processes and their impact on one another is necessary. We show different impact paths through which compute impacts mission-metrics and examine them using analytical models, simulation, and end-to-end benchmarking. To enable similar studies, we open sourced MAVBench, our tool-set consisting of a closed-loop simulator and a benchmark suite. Our investigations show cyber-physical co-design, a methodology where robot's cyber and physical processes/quantities are developed with one another consideration, similar to hardware-software co-design, is necessary for optimal robot design.
S. Krishnan, B. Boroujerdian, A. Faust, and V. J. Reddi, “Toward Exploring End-to-End Learning Algorithms for Autonomous Aerial Machines,” Workshop Algorithms And Architectures For Learning In-The-Loop Systems In Autonomous Flight with International Conference on Robotics and Automation (ICRA). 2019.Abstract

We develop AirLearning, a tool suite for endto-end closed-loop UAV analysis, equipped with a customized yet randomized environment generator in order to expose the UAV with a diverse set of challenges. We take Deep Q networks (DQN) as an example deep reinforcement learning algorithm and use curriculum learning to train a point to point obstacle avoidance policy. While we determine the best policy based on the success rate, we evaluate it under strict resource constraints on an embedded platform such as RasPi 3. Using hardware in the loop methodology, we quantify the policy’s performance with quality of flight metrics such as energy consumed, endurance and the average length of the trajectory. We find that the trajectories produced on the embedded platform are very different from those predicted on the desktop, resulting in up to 26.43% longer trajectories.

Quality of flight metrics with hardware in the loop characterizes those differences in simulation, thereby exposing how the choice of onboard compute contributes to shortening or widening of ‘Sim2Real’ gap.

M. S. Louis, et al., “Towards Deep Learning using TensorFlow Lite on RISC-V,” Third Workshop on Computer Architecture Research with RISC-V (CARRV). 2019.Abstract

Deep neural networks have been extensively adopted for a myriad of applications due to their ability to learn patterns from large amounts of data. The desire to preserve user privacy and reduce user-perceived latency has created the need to perform deep neural network inference tasks on low-power consumer edge devices. Since such tasks often tend to be computationally intensive, offloading this compute from mobile/embedded CPU to a purposedesigned "Neural Processing Engines" is a commonly adopted solution for accelerating deep learning computations. While these accelerators offer significant speed-ups for key machine learning kernels, overheads resulting from frequent host-accelerator communication often diminish the net application-level benefit of this heterogeneous system. Our solution for accelerating such workloads involves developing ISA extensions customized for machine learning kernels and designing a custom in-pipeline execution unit for these specialized instructions. We base our ISA extensions on RISC-V: an open ISA specification that lends itself to such specializations. In this paper, we present the software infrastructure for optimizing neural network execution on RISC-V with ISA extensions. Our ISA extensions are derived from the RISC-V Vector ISA proposal, and we develop optimized implementations of the critical kernels such as convolution and matrix multiplication using these instructions. These optimized functions are subsequently added to the TensorFlow Lite source code and cross-compiled for RISC-V. We find that only a small set of instruction extensions achieves coverage over a wide variety of deep neural networks designed for vision and speech-related tasks. On average, our software implementation using the extended instructions set reduces the executed instruction count by 8X in comparison to baseline implementation. In parallel, we are also working on the hardware design of the inpipeline machine learning accelerator. We plan to open-source our software modifications to TF Lite, as well as the micro-architecture design in due course.

J. Mohan, D. Purohith, M. Halpern, V. Chidambaram, and V. J. Reddi, “Storage on Your Smartphone Uses More Energy Than You Think,” USENIX HotStorage. 2017.Abstract

Energy consumption is a key concern for mobile devices. Prior research has focused on the screen and the network as the major sources of energy consumption. Through carefully designed measurement-based experiments, we show that for certain storage-intensive workloads, the storage subsystem on an Android smartphone consumes a significant amount of energy (36%), on par with screen energy consumption. We analyze the energy consumption of different storage primitives, such as sequential and random writes, on two popular mobile file systems, ext4 and F2FS. In addition, since most Android applications use SQLite for storage, we analyze the energy consumption of different SQLite operations. We present several interesting results from our analysis: for example, random writes consume 15× higher energy than sequential writes, and that F2FS consumes half the energy as ext4 for most workloads. We believe our results contribute useful design guidelines for the developers of energy-efficient mobile file systems.


Paper Presentation
M. Halpern, T. Mummert, M. Novak, E. Duesterwald, and V. J. Reddi, “The Case for Node Multi-Versioning in Cognitive Cloud Services: Achieving Responsiveness and Accuracy at Datacenter Scale,” Workshop on Cognitive Architectures (CogArch). 2016.Abstract

Cognitive cloud services seek to provide end-users with functionalities that have historically required human intellect to complete. End-users expect these services to be both responsive and accurate, which pose conflicting requirements for service providers. Today’s cloud services deployment schemes follow a “one size fits all” scale-out strategy, where multiple instantiations of the same version of the service are used to scale-out and handle all end-users. Meanwhile, many cognitive services are of a statistical nature where deeper exploration yields more accurate results but also requires more processing time. Finding a single service configuration setting that satisfies the latency and accuracy requirements for the largest number of expected end-user requests can be a challenging task. As a result, cognitive cloud service providers are conservatively configured to maximize the number of enduser requests for which a satisfactory latency-accuracy tradeoff can be achieved. Using a production-grade Automatic Speech Recognition cloud service as a representative example to study, we demonstrate the inefficiencies of this single version approach and propose a new service node multi-versioning deployment scheme for cognitive services instead. We present an oracle-based limit study where we show that service node multi-versioning can provide a 2.5X reduction in execution time together with a 24% improvement in accuracy over a traditional single version deployment scheme. We also discuss several design considerations to address when implementing service node multi-versioning.

S. Chai, D. Zhang, J. Leng, and V. J. Reddi, “Lightweight Detection and Recovery Mechanisms to Extend Algorithm Resiliency in Noisy Computation,” Workshop on Near-threshold Computing (WNTC). 2014.Abstract

— The intrinsic robustness of an algorithm and architecture depends highly on the combined ability tolerate noise. In this paper, we present an alternative approach for energy reduction for near threshold computing based on a statistical modeling of computational noise induced from noisy memory and non-ideal interconnects. We present this approach as a complement to the standard approximate computing approaches. We show results of the lightweight error checks and recovery based on several design considerations on data value speculation.

Index Terms—Approximate computing, noise resiliency, computation noise, near threshold computing

M. Kazdagli, L. Huang, V. REDDI, and M. Tiwari, “Morpheus: Benchmarking Computational Diversity in Mobile Malware,” Workshop on Hardware and Architectural Support for Security and Privacy (HASP). ACM, 2014.Abstract

Computational characteristics of a program can potentially be used to identify malicious programs from benign ones. However, systematically evaluating malware detection techniques, especially when malware samples are hard to run correctly and can adapt their computational characteristics, is a hard problem. We introduce Morpheus – a benchmarking tool that includes both real mobile malware and a synthetic malware generator that can be configured to generate a computationally diverse malware sample-set – as a tool to evaluate computational signatures based malware detection. Morpheus also includes a set of computationally diverse benign applications that can be used to repackage malware into, along with a recorded trace of over 1 hour long realistic human usage for each app that can be used to replay both benign and malicious executions.

The current Morpheus prototype targets Android applications and malware samples. Using Morpheus, we quantify the computational diversity in malware behavior and expose opportunities for dynamic analyses that can detect mobile malware. Specifically, the use of obfuscation and encryption to thwart static analyses causes the malicious execution to be more distinctive – a potential opportunity for detection. We also present potential challenges, specifically, minimizing false positives that can arise due to diversity of benign executions.

Categories and Subject Descriptors

D.4.6 [Security and Protection]: Invasive software


security, mobile malware, performance counters

S. Kanev, T. M. Jones, G. - Y. Wei, D. M. Brooks, and V. J. Reddi, “Measuring Code Optimization Impact on Voltage Noise,” Workshop on Silicon Errors in Logic - System Effects (SELSE). 2013.Abstract

In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a IntelR CoreTM2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typicalcase, resilient design.

A. Shye, et al., “Analysis of Path Profiling Information Generated With Performance Monitoring Hardware,” Workshop on Interaction between Compilers and Computer Architectures (INTERACT). IEEE, pp. 34–43, 2005. IEEE VersionAbstract

Even with the breakthroughs in semiconductor technology that will enable billion transistor designs, hardwarebased architecture paradigms alone cannot substantially improve processor performance. The challenge in realizing the full potential of these future machines is to find ways to adapt program behavior to application needs and processor resources. As such, run-time optimization will have a distinct role in future high performance systems. However, as these systems are dependent on accurate, fine-grain profile information, traditional approaches to collecting profiles at run-time result in significant slowdowns during program execution.

A novel approach to low-overhead profiling is to exploit hardware Performance Monitoring Units (PMUs) present in modern microprocessors. The Itanium-2 PMU can periodically sample the last few taken branches in an executing program and this information can be used to recreate partial paths of execution. With compiler-aided analysis, the partial paths can be correlated into full paths. As statistically hot paths are most likely to occur in PMU samples, even infrequent sampling can accurately identify these paths. While traditional path profiling techniques carry a high overhead, a PMU-based path profiler represents an effective lightweight profiling alternative. This paper characterizes the PMU-based path information and demonstrates the construction of such a PMU-based path profiler for a run-time system.

V. J. Reddi, D. Connors, and R. S. Cohn, “Persistence in Dynamic Code Transformation Systems,” ACM SIGARCH Computer Architecture News, vol. 33, no. 5. ACM, pp. 69–74, 2005. Publisher's VersionAbstract

Dynamic code transformation systems (DCTS) can broadly be grouped into three distinct categories: optimization, translation and instrumentation. All of these face the critical challenge of minimizing the overhead incurred during transformation since their execution is interleaved with the execution of the application itself. The common DCTS tasks incurring overhead are the identification of frequently executed code sequences, costly analysis of program information, and run-time creation (writing) of new code sequences. The cost of such work is amortized by the repeated execution of the transformed code. However, as these steps are applied to all general code regions (regardless of their execution frequency and characteristics), there is substantial overhead that impacts the application’s performance. As such, it is challenging to effectively deploy dynamic transformation under fixed performance constraints. This paper explores a technique for eliminating the overhead incurred by exploiting persistent application execution characteristics that are shared across different application invocations. This technique is implemented and evaluated in Pin, a dynamic instrumentation engine. This version of Pin is referred to as Persistent Pin (PPin). Initial PPin experimental results indicate that using information from prior runs can reduce dynamic instrumentation overhead of SPEC applications by as much as 25% and over 90% for everyday applications like web browsers, display rendering systems, and spreadsheet programs.

V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn, “PIN: A Binary Instrumentation Tool for Computer Architecture Research and Education,” Workshop on Computer architecture education (WCAE). ACM, pp. 22, 2004. Publisher's VersionAbstract

Computer architecture embraces a tremendous number of ever-changing inter-connected concepts and information, yet computer architecture education is very often static, seemingly motionless. Computer architecture is commonly taught using simple piecewise methods of explaining how the hardware performs a given task, rather than characterizing the interaction of software and hardware. Visualization tools allow students to interactively explore basic concepts in computer architecture but are limited in their ability to engage students in research and design concepts. Likewise as the development of simulation models such as caches, branch predictors, and pipelines aid student understanding of architecture components, such models have limitations in the workloads that can be examined because of issues with execution time and environment. Overall, to effectively understand modern architectures, it is simply essential to experiment the characteristics of real application workloads. Likewise, understanding program behavior is necessary to effective programming, comprehension of architecture bottlenecks, and hardware design. Computer architecture education must include experience in analyzing program behavior and workload characteristics using effective tools. To explore workload characteristic analysis in computer architecture design, we propose using PIN, a binary instrumentation tool for computer architecture research and education projects.