INT8 quantisation, in numbers — and why INT16 is the boring choice

What "INT8-quantised inference" actually means once you do the arithmetic, why dropping from FP32 to INT8 is a cliff and dropping to INT16 isn't, and why every interesting question about putting an ML model on real silicon ends up here.

The question — "what does INT8-quantised inference actually mean, and why is it different from INT16?" — comes up almost every time someone gets serious about putting a model onto real hardware. The answers tend to be either three lines of marketing ("it's smaller and faster") or three hours of slides on clipping distributions and KL-divergence calibration. Neither is what you need at the point you're asking.

What you need is the arithmetic. Specifically: take one weight, push it through the INT8 round-trip, see what happens to it; do the same with INT16; notice that the two answers are not from the same conversation.

Quantisation in two sentences

Quantisation, in inference, is the process of representing a tensor of floating-point numbers as a tensor of integers plus a small amount of side information — a scale, sometimes a zero-point — that tells you how to map back to the float values when you need to.

You do this for three reasons. Lower memory footprint — an INT8 tensor is a quarter the size of FP32, half the size of INT16. Faster compute — integer matmul throughput on every modern ML accelerator is several times faster than float matmul at the same nominal precision. Lower energy — integer ALUs draw less power than FPUs at every operating point.

The encode-and-decode round-trip looks like this:

q = round(x / s) + z      // quantise: float x to integer q
x = (q - z) * s           // dequantise: back to float

s is the scale factor; z is the zero-point (zero in symmetric quantisation; non-zero in asymmetric). Both are chosen at calibration time per tensor, or per channel of a tensor, depending on how aggressive you're being. That's the whole mechanism. The rest is engineering.

One weight, two precisions

Take a weight value with no special significance: w = 0.0732.

Suppose calibration has told you that this tensor's per-channel maximum absolute value is 0.5, and you're using symmetric quantisation. Then:

INT8 (signed, 256 values from -128 to 127):

s = 0.5 / 127 ≈ 0.003937
q = round(0.0732 / 0.003937) = round(18.59) = 19
stored as int8: 19
dequantised at inference: 19 × 0.003937 ≈ 0.07480
error: 0.07480 − 0.0732 = 0.00160   (about 2.2% of the value)

INT16 (signed, 65,536 values from -32,768 to 32,767):

s = 0.5 / 32767 ≈ 0.00001526
q = round(0.0732 / 0.00001526) = round(4797.09) = 4797
stored as int16: 4797
dequantised at inference: 4797 × 0.00001526 ≈ 0.07319
error: 0.07319 − 0.0732 = −0.0000147   (about 0.02% of the value)

The error in INT8 is two orders of magnitude larger than the error in INT16. That ratio — roughly 256× — is the entire story.

What 256× actually does to a model

A neural network doesn't care about any individual weight. It cares about the integrated effect of millions of weights and activations across many matmuls. So the question isn't "how big is the error on w = 0.0732?" but "how much do these errors compound when summed through the layers?"

For most networks the answer is: it depends on where the error lands.

In the bulk of a convolutional network, the natural noise floor of the activations is a few per cent. Quantisation error of around two per cent per weight, summed across thousands of MACs per output, mostly averages out — the rounding errors are roughly uncorrelated and behave like a small additive noise. INT8 typically costs 0.5–1.5% in top-1 accuracy on standard image-classification benchmarks. Not free; not catastrophic.

In transformer attention, the picture is worse. The softmax in attention has a tail of small values that matter — those are the long-range tokens the model is attending to. Quantising the inputs to softmax in INT8 destroys the tail: small probabilities round to zero, the distribution flattens, and the model loses its ability to attend specifically. The standard workaround is INT16 (or FP16) accumulators just for the softmax block, while the bulk weight matmul stays in INT8.

In the embedding layer of a language model, INT8 is brutal. Embeddings have long-tailed distributions; quantising them at 8 bits flattens out the rare tokens that are usually the interesting ones. Embeddings are commonly kept in INT16, or in a separate higher-precision codebook even when the rest of the model is INT8.

In the output classifier head, INT8 is usually fine; differences between top-k logits are typically large enough to survive the rounding.

INT16, by contrast, mostly behaves as if the model were still floating-point. The quantisation error is below the natural noise floor of the network at every layer. There are no published transformer benchmarks where INT16 weight quantisation has loss numbers worth a blog post about. It just works.

Why INT8 is the cliff and INT16 isn't

Here is the mental model.

Picture the dynamic range of one of your activation tensors as a number line from -A to +A. INT8 carves that line into 256 buckets. INT16 carves it into 65,536. The bucket width — the size of one quantisation step — determines whether your quantised value preserves the signal or loses it.

For a tensor with A = 5, INT8's bucket width is 5 / 127 ≈ 0.039. If the meaningful variation in your activations is finer than that — say, the difference between two attention scores is 0.01 — INT8 has thrown that information away. Both activations quantise to the same bucket and are now indistinguishable.
INT16's bucket width on the same range is 5 / 32767 ≈ 0.00015. Differences down to roughly four decimal places are preserved. The natural variation in a typical model's activations is several orders of magnitude above that, so INT16 captures it transparently.

This is why INT8 needs calibration and INT16 doesn't. The whole INT8 deployment workflow exists because you have to be careful about where you put your 256 buckets so the meaningful range of your tensor lands inside them. Pick a range that's too wide and you waste buckets on values that almost never occur; pick a range that's too narrow and the tails clip to ±127 and you lose dynamic range entirely.

INT16 doesn't have this problem because it has so many more buckets than you need that calibration is essentially a no-op — even a sloppy choice of range still leaves you with thousands of buckets covering the meaningful part of the distribution.

The hardware reality

The other half of the answer to "why is INT8 different from INT16" is that the silicon doesn't treat them as the same kind of thing.

INT8 is the prestige fast-path on every modern ML accelerator. Nvidia Tensor Cores have dedicated INT8 MMA instructions on Ampere and later, INT4 on Hopper. Google TPUs have a native INT8 matrix unit. Apple's Neural Engine, Qualcomm's Hexagon, ARM's Ethos NPUs — all of them treat INT8 as the dominant inference type and the one the hardware engineers designed around. INT8 matmul throughput is typically 2× the FP16 throughput and 4× the FP32 throughput on the same silicon. On Tensor Cores, INT8 is also the entry point for structured-sparsity bonuses (a further 2× on top).

INT16 is not a first-class type on most ML accelerators. It's supported, but the hardware path is usually slower than INT8 and not significantly different from FP16. Tensor Cores have no separate INT16 MMA — INT16 falls back to a slower mode. TPUs accumulate INT8 matmul into BF16, not INT16. Mobile NPUs do INT16 but most of them are tuned for INT8 first and INT16 second.

So when an engineer chooses between INT8 and INT16 they aren't choosing between two integer types with different bit-widths. They're choosing between the type the hardware was built to run and a type the hardware can run but isn't designed for. INT16 is best understood as a safety net — there when you can't make INT8 work, with a real performance cost.

Where INT16 still matters inside an INT8 pipeline

Even when you've deployed INT8 throughout, INT16 doesn't go away — it lives inside.

Accumulators. The matmul of INT8×INT8 produces values that won't fit in INT8. A 64-wide dot product of two INT8 vectors can hit 64 × 127 × 127 ≈ 1 million, far past the INT8 range. Accumulators in INT8 matmul are typically INT32, but the path back to INT8 — apply bias, multiply by the output scale, requantise — often goes through an INT16 stage. INT16 earns its keep as the intermediate between INT8 storage and INT32 accumulation.

Sensitive layers. The layers where INT8 doesn't survive — attention softmax, sometimes layer norm — are commonly held at INT16 (or FP16) while everything else stays INT8. The model becomes mostly INT8 with INT16 islands. This is the production reality for most deployed transformer-based models in 2026.

Calibration statistics. When you're collecting the statistics that determine your INT8 scale factors, the calibration values themselves are kept at higher precision. INT16 is a reasonable choice; FP16 is more common.

What it looks like in practice

When someone tells you they have "deployed a quantised model", three questions get you the actual picture:

Which weights and which activations are INT8, and which are INT16 or FP16? Tells you whether the model is truly INT8-throughout or a hybrid with INT16 escape hatches. Most real deployments are hybrids.
Was the quantisation post-training (PTQ) or quantisation-aware (QAT)? PTQ is fast to apply but has a lower accuracy ceiling. QAT requires retraining with simulated quantisation in the forward pass and is what you reach for when the model is accuracy-sensitive.
What did you calibrate on? The choice of calibration data determines where the INT8 buckets land. A model calibrated on the training distribution will behave badly if production data drifts; this is the most common reason a quantised model that benchmarks well in the lab degrades in the field.

The honest answer to what does INT8-quantised inference actually mean is the sum of those three answers plus the arithmetic of how many bits you have to represent each value. INT16 is what you'd reach for if all of that sounded like too much engineering. INT8 is what you reach for when you've decided the engineering is worth doing.

The boring choice

INT16 is boring because it works. It halves your memory, costs you almost no accuracy, and you can ship it without writing a calibration script. The reason the literature is full of INT8 papers and not INT16 papers is that INT16 doesn't generate publishable problems.

INT8 is interesting because it sits exactly at the edge of where quantisation error stops being noise and starts being signal loss. Below INT8 — INT4, INT2 — is the same edge an order of magnitude further out, and the engineering required is different again. Above INT8 — INT16 — is the boring side.

The reason to care about INT8 specifically is that it's the largest precision drop you can take where the hardware is designed to reward you for taking it. INT16 is half the memory of FP32 but barely faster. INT8 is half the memory of INT16 and two-to-four times faster on every accelerator you're likely to deploy on. The compounding of those two factors is why every interesting question about putting an ML model on real silicon ends up here.

If you've got further to go after that — INT4 weight quantisation for large language models, ternary or binary networks, mixed-precision schemes that quantise different layers to different bit-widths — you'll find the conceptual framework is the same. Pick the smallest integer type whose buckets still cover the meaningful range of your tensor's distribution; calibrate carefully; accept that the model will degrade if you go too aggressive. INT8 is the place most production deployments sit because it's the precision-and-throughput point at which most models survive the trade.