QuantizeLinear

Description

The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization granularity. The quantization formula isΒ yΒ =Β saturate((xΒ /Β y_scale)Β +Β y_zero_point).

 

 

Saturation is done according to:

    • uint16: [0, 65535]
    • int16: [-32768, 32767]
    • uint8: [0, 255]
    • int8: [-128, 127]
    • uint4: [0, 15]
    • int4: [-8, 7]

 

ForΒ (xΒ /Β y_scale), it rounds to the nearest even. Refer toΒ https://en.wikipedia.org/wiki/RoundingΒ for details.

y_zero_pointΒ andΒ yΒ must have the same type.Β y_zero_pointΒ is usually not used for quantization to float8 and 4bit types, but the quantization formula remains the same for consistency, and the type of the attributeΒ y_zero_pointΒ still determines the quantization type.Β xΒ andΒ y_scaleΒ are allowed to have different types. The type ofΒ y_scaleΒ determines the precision of the division operation betweenΒ xΒ andΒ y_scale, unless theΒ precisionΒ attribute is specified.

There are three supported quantization granularities, determined by the shape ofΒ y_scale. In all cases,Β y_zero_pointΒ must have the same shape asΒ y_scale.

    • Per-tensor (per-layer) quantization:Β y_scaleΒ is a scalar.
    • Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shapeΒ (D0,Β ...,Β Di,Β ...,Β Dn)Β andΒ axis=i,Β y_scaleΒ is a 1-D tensor of lengthΒ Di.
    • Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. GivenΒ xΒ shapeΒ (D0,Β ...,Β Di,Β ...,Β Dn),Β axis=i, and block sizeΒ B:Β y_scaleΒ shape isΒ (D0,Β ...,Β ceil(Di/B),Β ...,Β Dn).

 

Input parameters

 

specified_outputs_name :Β array, this parameter lets you manually assign custom names to the output tensors of a node.

Β Graphs in :Β cluster, ONNX model architecture.

x (heterogeneous) –Β T1 : object, N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) – T2 : object, scale for doing quantization to getΒ y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.
y_zero_point (optional, heterogeneous) – T3 : object, zero point for doing quantization to getΒ y. Shape must matchΒ y_scale. Default is uint8 with zero point of 0 if it’s not specified.

Β Parameters : cluster,

axis : integer, the axis of the dequantizing dimension of the input tensor. Used only for per-axis and blocked quantization. Negative value means counting dimensions from the back. Accepted range isΒ [-r,Β r-1]Β whereΒ rΒ =Β rank(input). When the rank of the input is 1, per-tensor quantization is applied, rendering the axis unnecessary in this scenario.
Default value β€œ0”.
saturateΒ :Β boolean, the parameter defines how the conversion behaves if an input value is out of range of the destination type. It only applies for float 8 quantization (float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz). It is true by default. All cases are fully described in two tables inserted in the operator description.
Default value β€œTrue”.
Β training?Β :Β boolean, whether the layer is in training mode (can store data for backward).
Default value β€œTrue”.
Β lda coeff :Β float, defines the coefficient by which the loss derivative will be multiplied before being sent to the previous layer (since during the backward run we go backwards).
Default value β€œ1”.

Β name (optional) :Β string, name of the node.

Output parameters

 

y (heterogeneous) –Β T3 : object, N-D quantized output tensor. It has same shape as inputΒ x.

Type Constraints

T1 in (tensor(bfloat16),Β tensor(float),Β tensor(float16),Β tensor(int32)) : The type of the input β€˜x’.

T2 in (tensor(bfloat16),Β tensor(float),Β tensor(float16),Β tensor(float8e8m0),Β tensor(int32)) : The type of the input β€˜y_scale’.

T3 in (tensor(float4e2m1),Β tensor(float8e4m3fn),Β tensor(float8e4m3fnuz),Β tensor(float8e5m2),Β tensor(float8e5m2fnuz),Β 
tensor(int16),Β tensor(int4),Β tensor(int8),Β tensor(uint16),Β tensor(uint4),Β tensor(uint8)) : The type of the inputΒ y_zero_pointΒ and the outputΒ y.

Example

All these exemples are snippets PNG, you can drop these Snippet onto the block diagram and get the depicted code added to your VI (Do not forget to install Deep Learning library to run it).
Table of Contents