Dynamic Optimization

For years, the steadily growing clock speed of new processors has been relied upon to deliver increased performance for a wide range of applications consistently. However, processor frequency scaling has stalled, driving microprocessor companies to investigate adding value through alternative technologies. To continue making gains in system performance existing systems need to optimize execution dynamically.

Dynamic code optimizers are a type of runtime systems that modify an application at run-time to promote desirable execution characteristics, such as high performance, low power, or better-managed resource consumption on the target platform. Most dynamic optimizers first observe run-time features to determine frequently executed sequences of code for performing optimizations within a given instance of execution. These sequences can span procedure boundaries as well as across shared libraries and therefore have more potential for optimization. After accumulating sufficient knowledge, the optimizers regenerate/translate the code into better versions to achieve the desired program/system behavior.

Runtime optimizer

The typical architecture of a runtime system is shown above. We build systems that monitor execution using hardware and software counters, and then dynamically optimize the code by invoking optimizers that specialize in code transformations. Typically, only the hot code is optimized as the penalty paid in optimizing the hot code is quickly recouped or amortized over the improvements we observe from the optimized code execution. We apply these techniques in the context of binary translators, just-in-time compilers for managed languages like JS and Java to improve performance, power, and reliability. 

Publications

W. Cui, D. Richins, Y. Zhu, and V. J. Reddi, “Tail Latency in Node.js: Energy Efficient Turbo Boosting for Long Latency Requests in Event-Driven Web Services,” in Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2019.Abstract

Cloud-based Web services are shifting to the event-driven, scripting language-based programming model to achieve productivity, flexibility, and scalability. Implementations of this model, however, generally suffer from long tail latencies, which we measure using Node.js as a case study. Unlike in traditional thread-based systems, reducing long tails is difficult in event-driven systems due to their inherent asynchronous programming model. We propose a framework to identify and optimize tail latency sources in scripted eventdriven Web services. We introduce profiling that allows us to gain deep insights into not only how asynchronous eventdriven execution impacts application tail latency but also how the managed runtime system overhead exacerbates the tail latency issue further. Using the profiling framework, we propose an event-driven execution runtime design that orchestrates the hardware’s boosting capabilities to reduce tail latency. We achieve higher tail latency reductions with lower energy overhead than prior techniques that are unaware of the underlying event-driven program execution model. The lessons we derive from Node.js apply to other event-driven services based on scripting language frameworks.

L. Guckert, M. O’Connor, K. S. Ravindranath, Z. Zhao, and J. V. Reddi, “A Case for Persistent Caching of Compiled Javascript Code in Mobile Web Browsers,” in Workshop on Architectural and Microarchitectural Support for Binary Translation (AMAS-BT), 2013.Abstract

Over the past decade webpages have grown an order of magnitude in computational complexity. Modern webpages provide rich and complex interactive behaviors for differentiated user experiences. Many of these new capabilities are delivered via JavaScript embedded within these webpages. In this work, we evaluate the potential benefits of persistently caching compiled JavaScript code in the Mozilla JavaScript engine within the Firefox browser. We cache compiled byte codes and generated native code across browser sessions to eliminate the redundant compilation work that occurs when webpages are revisited. Current browsers maintain persistent caches of code and images received over the network. Current browsers also maintain inmemory “caches” of recently accessed webpages (WebKit’s Page Cache or Firefox’s “Back-Forward” cache) that do not persist across browser sessions. This paper assesses the performance improvement and power reduction opportunities that arise from caching compiled JavaScript across browser sessions. We show that persistent caching can achieve an average of 91% reduction in compilation time for top webpages and 78% for HTML5 webpages. It also reduces energy consumption by an average of 23% as compared to the baseline.

S. Campanoni, T. Jones, G. Holloway, V. J. Reddi, G. - Y. Wei, and D. Brooks, “HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing,” in Proceedings of the Tenth International Symposium on Code Generation and Optimization, 2012, pp. 84–93. Publisher's VersionAbstract

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel✌R Core❚▼ i7-980X, HELIX achieves speedups averaging 2.25✂, with a maximum of 4.12✂, for thirteen C benchmarks from SPEC CPU2000.

V. J. Reddi, D. Connors, R. Cohn, and M. D. Smith, “Persistent Code Caching: Exploiting Code Reuse Across Executions and Applications,” in Code Generation and Optimization, 2007. CGO'07. International Symposium on, 2007, pp. 74–88. Publisher's VersionAbstract

Run-time compilation systems are challenged with the task of translating a program’s instruction stream while maintaining low overhead. While software managed code caches are utilized to amortize translation costs, they are ineffective for programs with short run times or large amounts of cold code. Such program characteristics are prevalent in real-life computing environments, ranging from Graphical User Interface (GUI) programs to large-scale applications such as database management systems. Persistent code caching addresses these issues. It is described and evaluated in an industry-strength dynamic binary instrumentation system – Pin. The proposed approach improves the intra-execution model of code reuse by storing and reusing translations across executions, thereby achieving inter-execution persistence. Dynamically linked programs leverage inter-application persistence by using persistent translations of library code generated by other programs. New translations discovered across executions are automatically accumulated into the persistent code caches, thereby improving performance over time. Inter-execution persistence improves the performance of GUI applications by nearly 90%, while inter-application persistence achieves a 59% improvement. In more specialized uses, the SPEC2K INT benchmark suite experiences a 26% improvement under dynamic binary instrumentation. Finally, a 400% speedup is achieved in translating the Oracle database in a regression testing environment.

Q. Wu, et al., “A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance,” in Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, 2005, pp. 271–282. Publisher's VersionAbstract

Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS timeinterrupts, or static-compiler techniques. However, substantially greater gains can be realized when control opportunities are also explored in a dynamic compilation environment. There are several advantages to deploying DVFS and managing energy/performance tradeoffs through the use of a dynamic compiler. Most importantly, dynamic compiler driven DVFS is fine-grained, code-aware, and adaptive to the current microarchitecture environment. This paper presents a design framework of the run-time DVFS optimizer in a general dynamic compilation system. A prototype of the DVFS optimizer isimplemented and integrated into an industrialstrength dynamic compilation system. The obtained optimization system is deployed in a real hardware platform that directly measures CPU voltage and current for accurate power and energy readings. Experimental results, based on physical measurements for over 40 SPEC or Olden benchmarks, show that significant energy savings are achieved with little performance degradation. SPEC2K FP benchmarks benefit with energy savings of up to 70% (with 0.5% performance loss). In addition, SPEC2K INT show up to 44% energy savings (with 5% performance loss), SPEC95 FP save up to 64% (with 4.9% performance loss), and Olden save up to 61% (with 4.5% performance loss). On average, the technique leads to an energy delay product (EDP) improvement that is 3X-5X better than static voltage scaling, and is more than 2X (22% vs. 9%) better than the reported DVFS results of prior static compiler work. While the proposed technique is an effective method for microprocessor voltage and frequency control, the design framework and methodology described in this paper have broader potential to address other energy and power issues such as di/dt and thermal control.

V. J. Reddi, D. Connors, and R. S. Cohn, “Persistence in Dynamic Code Transformation Systems,” ACM SIGARCH Computer Architecture News, vol. 33, no. 5. ACM, pp. 69–74, 2005. Publisher's VersionAbstract

Dynamic code transformation systems (DCTS) can broadly be grouped into three distinct categories: optimization, translation and instrumentation. All of these face the critical challenge of minimizing the overhead incurred during transformation since their execution is interleaved with the execution of the application itself. The common DCTS tasks incurring overhead are the identification of frequently executed code sequences, costly analysis of program information, and run-time creation (writing) of new code sequences. The cost of such work is amortized by the repeated execution of the transformed code. However, as these steps are applied to all general code regions (regardless of their execution frequency and characteristics), there is substantial overhead that impacts the application’s performance. As such, it is challenging to effectively deploy dynamic transformation under fixed performance constraints. This paper explores a technique for eliminating the overhead incurred by exploiting persistent application execution characteristics that are shared across different application invocations. This technique is implemented and evaluated in Pin, a dynamic instrumentation engine. This version of Pin is referred to as Persistent Pin (PPin). Initial PPin experimental results indicate that using information from prior runs can reduce dynamic instrumentation overhead of SPEC applications by as much as 25% and over 90% for everyday applications like web browsers, display rendering systems, and spreadsheet programs.

T. Moseley, et al., “Dynamic Run-time Architecture Techniques For Enabling Continuous Optimization,” in Proceedings of the 2nd conference on Computing frontiers, 2005, pp. 211–220. Publisher's VersionAbstract

Future computer systems will integrate tens of multithreaded processor cores on a single chip die, resulting in hundreds of concurrent program threads sharing system resources. These designs will be the cornerstone of improving throughput in high-performance computing and server environments. However, to date, appropriate systems software (operating system, run-time system, and compiler) technologies for these emerging machines have not been adequately explored. Future processors will require sophisticated hardware monitoring units to continuously feed back resource utilization information to allow the operating system to make optimal thread co-scheduling decisions and also to software that continuously optimizes the program itself. Nevertheless, in order to continually and automatically adapt systems resources to program behaviors and application needs, specific run-time information must be collected to adequately enable dynamic code optimization and operating system scheduling. Generally, run-time optimization is limited by the time required to collect profiles, the time required to perform optimization, and the inherent benefits of any optimization or decisions. Initial techniques for effectively utilizing runtime information for dynamic optimization and informed thread scheduling in future multithreaded architectures are presented.