UpdatedAugust 26, 2025

QuantizeLinear

Description

The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization granularity. The quantization formula is y = saturate((x / y_scale) + y_zero_point).

Saturation is done according to:

- uint16: [0, 65535]
- int16: [-32768, 32767]
- uint8: [0, 255]
- int8: [-128, 127]
- uint4: [0, 15]
- int4: [-8, 7]

For (x / y_scale), it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.

y_zero_point and y must have the same type. y_zero_point is usually not used for quantization to float8 and 4bit types, but the quantization formula remains the same for consistency, and the type of the attribute y_zero_point still determines the quantization type. x and y_scale are allowed to have different types. The type of y_scale determines the precision of the division operation between x and y_scale, unless the precision attribute is specified.

There are three supported quantization granularities, determined by the shape of y_scale. In all cases, y_zero_point must have the same shape as y_scale.

- Per-tensor (per-layer) quantization: y_scale is a scalar.
- Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape (D0, ..., Di, ..., Dn) and axis=i, y_scale is a 1-D tensor of length Di.
- Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. Given x shape (D0, ..., Di, ..., Dn), axis=i, and block size B: y_scale shape is (D0, ..., ceil(Di/B), ..., Dn).

Input parameters

specified_outputs_name : array, this parameter lets you manually assign custom names to the output tensors of a node.

Graphs in : cluster, ONNX model architecture.

x (heterogeneous) – T1 : object, N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) – T2 : object, scale for doing quantization to get y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.
y_zero_point (optional, heterogeneous) – T3 : object, zero point for doing quantization to get y. Shape must match y_scale. Default is uint8 with zero point of 0 if it’s not specified.

Parameters : cluster,

axis : integer, the axis of the dequantizing dimension of the input tensor. Used only for per-axis and blocked quantization. Negative value means counting dimensions from the back. Accepted range is [-r, r-1] where r = rank(input). When the rank of the input is 1, per-tensor quantization is applied, rendering the axis unnecessary in this scenario.
Default value “0”.
saturate : boolean, the parameter defines how the conversion behaves if an input value is out of range of the destination type. It only applies for float 8 quantization (float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz). It is true by default. All cases are fully described in two tables inserted in the operator description.
Default value “True”.
training? : boolean, whether the layer is in training mode (can store data for backward).
Default value “True”.
lda coeff : float, defines the coefficient by which the loss derivative will be multiplied before being sent to the previous layer (since during the backward run we go backwards).
Default value “1”.

name (optional) : string, name of the node.

Output parameters

y (heterogeneous) – T3 : object, N-D quantized output tensor. It has same shape as input x.

Type Constraints

T1 in (tensor(bfloat16), tensor(float), tensor(float16), tensor(int32)) : The type of the input ‘x’.

T2 in (tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e8m0), tensor(int32)) : The type of the input ‘y_scale’.

T3 in (tensor(float4e2m1), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz),
tensor(int16), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8)) : The type of the input y_zero_point and the output y.

Example

All these exemples are snippets PNG, you can drop these Snippet onto the block diagram and get the depicted code added to your VI (Do not forget to install Deep Learning library to run it).

SOTA

Installation guide

General

Accelerator

Installation guide

Execution providers

General

Execution

CUDA Advanced

Construct Ptr Input Data

Index

Name

Exec

Exec

Mono

Multi

Input

Deep Learning

Installation guide

Execution providers

General

Architecture

Layers

Nodes

Nodes

Activation

Mono Input

Parameters

Graph Function

Graph

File

Get & Set

Runtime

Create

Academic Training

Inference

Training

Exec

Academic

Input

Advanced

Add Weight

Index

Name

Convert

From ONNX

To ONNX

Format Weight

Get Weight

More

Layers parameters

Nodes Parameters

QuantizeLinear

Description

Input parameters

Output parameters

Type Constraints

Example