Abstract

Quantization
- Reduce the size of Deep Learning Networks
- Improve inference latency and throughput by taking advantage of high throughput integer instructions

1. Introduction

Low-precision formats benefits
- Many processors provide higher thoughput math pipelines → speed up operations
- Reduce memory bandwidth pressure → improving performance for bandwidth limited computations
- Lower memory size requirements → improve cache utilization, memory system operation etc

<aside> 👩🏼‍🏫 In this paper we focus on integer quantization for neural network inference, where networks are modified to use integer weights and activations so that integer math pipelines can be used for many operations

</aside>

Untitled

2. Related works

3. Quantization Fundamentals

Uniform quantization
1. Choose the range of real numbers to be quantized, clamping the values outside this range
2. Map the real values to integers representable by the bit-width of the quantized representation

3-1. Range Mapping

Let $[β, α]$ be the range of representable real values chosen for quantization and b be the bit-width of the signed integer representation. Uniform quantization transforms the input value $x ∈ [β, α]$ to lie within $[−2^{b−1}, 2^{b−1} − 1]$, where inputs outside the range are clipped to the nearest bound.

3-1-1. Affine Quantization

Affine transform function : $f(x) = s \cdot x + z$

s : Scale factor
- $s =\frac{2^b − 1}{α − β}$
z : zero point → the integer value to which the real value zero is mapped
- $z = − round(β · s) − 2$
Quantize operation

$clip(x, l, u)\left\lbrace \begin{array}{l} l, \;\; x<l\\ x, \; l\leq x \leq u\\ u, \; x > u \end{array}\right.$

$x_q = quantize(x, b, s, z) = clip(round(s \cdot x + z), −2^{b−1}, 2^{b−1} − 1)$

Dequantize operation

$\hat{x} = dequantize ( x_q,s,z ) = \frac{1}{s} (x_q - z)$