PrecisionBatching: Bitserial Decomposition for Efficient Neural Network Inference on GPUs


M. Lam, Z. Yedidia, C. Banbury, and V. J. Reddi, “PrecisionBatching: Bitserial Decomposition for Efficient Neural Network Inference on GPUs,” in The 30th International Conference on Parallel Architectures and Compilation Techniques, presented at PACT 2021, virtual, September 26-29, 2021.
PDF Version4.4 MB


We present PrecisionBatching, a quantized inference algorithm for speeding up neural network inference on traditional hardware platforms at low bitwidths. PrecisionBatching is based on the following insights: 1) neural network inference with low batch sizes on traditional hardware architectures (e.g: GPUs) is memory bound, 2) activation precision is critical to improving quantized model quality and 3) matrix-vector multiplication can be decomposed into binary matrix-matrix multiplications, enabling quantized inference with higher precision activations at the cost of more arithmetic operations. Combining these three insights, PrecisionBatching enables inference at extreme quantization levels (< 8 bits) by shifting a memory bound problem to a compute bound problem and achieves higher compute efficiency and runtime speedup at fixed accuracy thresholds against standard quantized inference methods. Across a variety of applications (MNIST, language modeling, natural language inference, reinforcement learning) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1-5% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.

IEEE Version

Last updated on 12/16/2021