Big data, specifically data analytics, is responsible for driving many of consumers’ most common online activities, including shopping, web searches, and interactions on social media. In this paper, we present the first (micro)architectural investigation of a new industry-standard, open source benchmark suite directed at big data analytics applications—TPCx-BB (BigBench). Where previous work has usually studied benchmarks which oversimplify big data analytics, our study of BigBench reveals that there is immense diversity among applications, owing to their varied data types, computational paradigms, and analyses. In our analysis, we also make an important discovery generally restricting processor performance in big data. Contrary to conventional wisdom that big data applications lend themselves naturally to parallelism, we discover that they lack sufficient thread-level parallelism (TLP) to fully utilize all cores. In other words, they are constrained by Amdahl’s law. While TLP may be limited by various factors, ultimately we find that single-thread performance is as relevant in scale-out workloads as it is in more classical applications. To this end we present core packing: a software and hardware solution that could provide as much as 20% execution speedup for some big data analytics applications.
We have already entered the heterogeneous computing era when computing systems harness computational horsepower from not only general purpose CPUs but also other processors such as graphics processing unit (GPU) and hardware accelerators. Performance, power-efficiency, and reliability are three most critical aspects of processors, and there usually exists a tradeoff among them. Accelerators are heavily optimized for performance and power-efficiency rather than reliability. However, it is equally important to ensure overall reliability while introducing accelerators to computing systems. In this paper, we focus on optimizing accelerator’s reliability without adopting the “whac-a-mole” paradigm which develops accelerator-specific reliability optimization. Instead, we advocate maintaining the reliability at the system level, and propose the design paradigm called “asymmetric resilience,” whose principle is to develop the reliable heterogeneous system centering around the CPU architecture. This generic design paradigm eases accelerators away from reliability optimization. We present the design principles and practices for the heterogeneous system that adopt such design paradigm. Following the principles of asymmetric resilience, we demonstrate how to use CPU architecture to handle GPU execution errors, which allows GPU focus on typical case operation for better energy efficiency. We explore the design space and show that the average overhead is only 1% for error-free execution and the overhead increases linearly with error probability.
our contributions are as follows: • We investigate DSA and characterize the effectiveness of category-awareness. • We conduct a limit study to understand the benefit of applying approximation in a perframe manner with category-awareness (category-aware dynamic DSA). • We present the challenges of harnessing DSA and a proof-of-concept runtime.
Voltage stacking (VS) fundamentally improves power delivery efficiency (PDE) by series-stacking multiple voltage domains to eliminate explicit step-down voltage conversion and reduce energy loss along the power delivery path. However, it suffers from aggravated supply noise, preventing its adoption in mainstream computing systems. In this paper, we investigate a practical approach to enabling efficient and reliable power delivery in voltage-stacked manycore systems that can ensure worst-case supply noise reliability without excessive costly over-design. We start by developing an analytical model to capture the essential noise behaviors in VS. It allows us to identify dominant noise contributor and derive the worst-case conditions. With this in-depth understanding, we propose a hybrid voltage regulation solution to effectively mitigate noise with worst-case guarantees. When evaluated with real-world benchmarks, our solution can achieve 93.8% power delivery efficiency, an improvement of 13.9% over the conventional baseline.
Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Among them, Micro Aerial Vehicles (MAVs) have seen an outburst of attention recently, specifically in the area with a demand for autonomy. A key challenge standing in the way of making MAVs autonomous is that researchers lack the comprehensive understanding of how performance, power, and computational bottlenecks affect MAV applications. MAVs must operate under a stringent power budget, which severely limits their flight endurance time. As such, there is a need for new tools, benchmarks, and methodologies to foster the systematic development of autonomous MAVs. In this paper, we introduce the “MAVBench” framework which consists of a closed-loop simulator and an end-to-end application benchmark suite. A closed-loop simulation platform is needed to probe and understand the intra-system (application data flow) and inter-system (system and environment) interactions in MAV applications to pinpoint bottlenecks and identify opportunities for hardware and software co-design and optimization. In addition to the simulator, MAVBench provides a benchmark suite, the first of its kind, consisting of a variety of MAV applications designed to enable computer architects to perform characterization and develop future aerial computing systems. Using our open source, end-to-end experimental platform, we uncover a hidden, and thus far unexpected compute to total system energy relationship in MAVs. Furthermore, we explore the role of compute by presenting three case studies targeting performance, energy and reliability. These studies confirm that an efficient system design can improve MAV’s battery consumption by up to 1.8X.
Mobile computing has grown drastically over the past decade. Despite the rapid pace of advancements, mobile device understanding, benchmarking, and evaluation are still in their infancies, both in industry and academia. This article presents an industry perspective on the challenges facing mobile computer architecture, specifically involving mobile workloads, benchmarking, and experimental methodology, with the hope of fostering new research within the community to address pending problems. These challenges pose a threat to the systematic development of future mobile systems, which, if addressed, can elevate the entire mobile ecosystem to the next level.
Mobile devices have come a long way from the first portable cellular phone developed by Motorola in 1973. Most modern smartphones are good enough to replace desktop computers. A smartphone today has enough computing power to be on par with the fastest supercomputers from the 1990s.
For instance, the Qualcomm Adreno 540 GPU found in the latest smartphones has a peak compute capability of more than 500 Gflops, putting it in competition with supercomputers that were on the TOP500 list in the early to mid-1990s. Mobile computing has experienced an unparalleled level of growth over the past decade. At the time of this writing, there are more than 2 billion mobile devices in the world.1 But perhaps even more importantly, mobile phones are showing no signs of slowing in uptake. In fact, smartphone adoption rates are on the rise. The number of devices is rising as mobile device penetration increases in markets like India and China. It is anticipated that the number of mobile subscribers will grow past 6 billion in the coming years.2 As Figure 1 shows, while the Western European and North American markets are reaching saturation, the vast majority of growth is coming from countries in Asia. Given that only 35 percent of the world’s population has thus far adopted mobile technology, there is still significant room for growth and innovation.
Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Although the researchers in the robotic domain have made rapid progress in recent years, hardware and software architects in the computer architecture community lack the comprehensive understanding of how performance, power, and computational bottlenecks affect UAV applications. Such an understanding enables system architects to design microchips tailored for aerial agents. This paper is an attempt by computer architects to initiate the discussion between the two academic domains by investigating the underlying compute systems’ impact on aerial robotic applications. To do so, we identify performance and energy constraints and examine the impact of various compute knobs such as processor cores and frequency on these constraints. Our experiment show that such knobs allow for up to 5X speed up for a wide class of applications.