Runtime Systems

A runtime system is a framework that typically monitors and orchestrates execution. There are many different types of runtime systems. Some runtime systems manage code to improve performance dynamically while the program is running; others monitor the code to understand what the code is doing or how it is executing on an underlying system. Of course, there are even language level runtimes which dynamically translate high-level program syntax on the fly into executable binary code, and examples of such systems include JS and Py runtimes.

Our group is focused on building feedback directed runtime optimizers for generating and optimizing code to run efficiently on the hardware where efficiency means better performance or lower power consumption or even improved reliability on a faulty piece of hardware.

Select Publications

M. S. Louis, et al., “Towards Deep Learning using TensorFlow Lite on RISC-V,” Third Workshop on Computer Architecture Research with RISC-V (CARRV). 2019.Abstract

Paper

Deep neural networks have been extensively adopted for a myriad of applications due to their ability to learn patterns from large amounts of data. The desire to preserve user privacy and reduce user-perceived latency has created the need to perform deep neural network inference tasks on low-power consumer edge devices. Since such tasks often tend to be computationally intensive, offloading this compute from mobile/embedded CPU to a purposedesigned "Neural Processing Engines" is a commonly adopted solution for accelerating deep learning computations. While these accelerators offer significant speed-ups for key machine learning kernels, overheads resulting from frequent host-accelerator communication often diminish the net application-level benefit of this heterogeneous system. Our solution for accelerating such workloads involves developing ISA extensions customized for machine learning kernels and designing a custom in-pipeline execution unit for these specialized instructions. We base our ISA extensions on RISC-V: an open ISA specification that lends itself to such specializations. In this paper, we present the software infrastructure for optimizing neural network execution on RISC-V with ISA extensions. Our ISA extensions are derived from the RISC-V Vector ISA proposal, and we develop optimized implementations of the critical kernels such as convolution and matrix multiplication using these instructions. These optimized functions are subsequently added to the TensorFlow Lite source code and cross-compiled for RISC-V. We find that only a small set of instruction extensions achieves coverage over a wide variety of deep neural networks designed for vision and speech-related tasks. On average, our software implementation using the extended instructions set reduces the executed instruction count by 8X in comparison to baseline implementation. In parallel, we are also working on the hardware design of the inpipeline machine learning accelerator. We plan to open-source our software modifications to TF Lite, as well as the micro-architecture design in due course.

N. Chachmon, D. Richins, R. Cohn, M. Christensson, W. Cui, and V. J. Reddi, “Simulation and Analysis Engine for Scale-Out Workloads,” in Proceedings of the 2016 International Conference on Supercomputing (ICS), 2016, pp. 22. Publisher's Version Abstract

Paper

We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the BIOS, kernel, drivers, and user processes. It can also instrument multiple systems simultaneously using a single instrumentation interface, which is essential for studying scale-out applications. SAE is an x86 instruction set simulator designed specifically to enable rapid prototyping, evaluation, and validation of architectural extensions and program analysis tools using its flexible APIs. It is fast enough to execute full platform workloads—a modern operating system can boot in a few minutes—thus enabling research, evaluation, and validation of complex functionalities related to multicore configurations, virtualization, security, and more. To reach high speeds, SAE couples tightly with a virtual platform and employs both a just-in-time (JIT) compiler that helps simulate simple instructions eciently and a fast interpreter for simulating new or complex instructions. We describe SAE’s architecture and instrumentation engine design and show the framework’s usefulness for single- and multi-system architectural and program analysis studies.

C. - K. Luk, et al., “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,” in Programming Language Design and Implementation (PLDI), 2005, no. 6. Publisher's Version Abstract

Paper

Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin’s rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application’s original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin’s versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), ItaniumR , and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

Categories and Subject Descriptors

D.2.5 [Software Engineering]: Testing and Debugging-code inspections and walk-throughs, debugging aids, tracing; D.3.4 [Programming Languages]: Processorscompilers, incremental compilers

General Terms

Languages, Performance, Experimentation

Keywords

Instrumentation, program analysis tools, dynamic compilation

See All Runtimes Systems Publications

Select Publications

Categories and Subject Descriptors

General Terms

Keywords

CSS