Processor design over the past years has evolved from single-core CPUs to multicore CPUs and heterogeneous processors that integrate CPUs and GPUs. Despite all this innovation, demand for computational horsepower continues to surge, and as such, there is a growing need for performance at the system level. But with the end of Dennard scaling, sustaining performance improvements that match the needs through conventional CPU and GPU design enhancements is not cutting it. Therefore, specialized execution units that are orders of magnitude more efficient in power and performance for specific tasks are on the rise.
System-on-chip architectures (SoC) integrate multiple specialized execution units into a holistic processor architecture to deliver high performance at a low energy cost. For instance, the above A12 die photo that has been annotated with white boxes to show the number of hardware accelerator units in the iPhone 10 processor. Outside of the modest area consumed by the CPU and GPU, the rest of the chip is dedicated to specialized fixed function units like the Neural Processor Engine (used for machine learning tasks). Mobile SoCs are the harbingers of the future as they already incorporate many specialized execution units.
We study how the integration of various accelerators need to be coordinated and executed to ensure a balanced execution profile in what we call as "accelerator level parallelism." To this end, we develop performance models for accelerator units, investigate coordination strategies across the various accelerators that are often running concurrently and putting stress on shared resources, and in some cases even consider the design of custom accelerator solutions.