UpdatedAugust 29, 2025

SparseAttention

Description

Block Sparse Attention used in Phi-3-small (https://arxiv.org/pdf/2404.14219). It is inspired by Sparse Transformers (https://arxiv.org/pdf/1904.10509) and BigBird (https://arxiv.org/pdf/2007.14062).

block_mask can be used to configure sparse layout for different head. When number of sparse layout is 1, all heads have same sparse layout. Otherwise, different layouts are used cyclically. For example, given 4 layouts (S0, S1, S2, S3), 8 heads will have layouts like (S0, S1, S2, S3, S0, S1, S2, S3).

The block_row_indices and block_col_indices are the CSR representation of block mask. The block_col_indices might contain paddings at the right side when different layout has different number of non-zeros in block mask.

An example of block mask with 2 layouts where each layout is 4 x 4 blocks: [[[1, 0, 0, 0], [1, 1, 0, 0], [0, 1, 1, 0], [0, 1, 1, 1]],

 [[1, 0, 0, 0],
  [1, 1, 0, 0],
  [1, 1, 1, 0],
  [1, 0, 1, 1]]]

The corresponding CSR format: block_col_indices = [[0, 0, 1, 1, 2, 1, 2, 3, -1], [0, 0, 1, 0, 1, 2, 0, 2, 3]] block_row_indices = [[0, 1, 3, 5, 8], [0, 1, 3, 6, 9]]

When do_rotary is True, cos_cache and sin_cache are required. Note that the maximum sequence length supported by cos or sin cache can be different from the maximum sequence length used by kv cache.

Only supports unidirectional attention with cache of past key and value in linear buffers.

For performance, past_key and present_key share same memory buffer, and past_value and present_value too.

Input parameters

specified_outputs_name : array, this parameter lets you manually assign custom names to the output tensors of a node.

Graphs in : cluster, ONNX model architecture.

query (heterogeneous) – T : object, query with shape (batch_size, sequence_length, num_heads * head_size), or packed QKV with shape is(batch_size, sequence_length, d) where d is (num_heads + 2 * kv_num_heads) * head_size.
key (optional, heterogeneous) – T : object, key with shape (batch_size, sequence_length, kv_num_heads * head_size).
value (optional, heterogeneous) – T : object, value with shape (batch_size, sequence_length, kv_num_heads * head_size).
past_key (heterogeneous) – T : object, key cache with shape (batch_size, kv_num_heads, max_cache_sequence_length, head_size).
past_value (heterogeneous) – T : object, value cache with shape (batch_size, kv_num_heads, max_cache_sequence_length, head_size).
block_row_indices (heterogeneous) – M : object, the row indices of CSR format of block mask with shape (num_layout, max_blocks + 1).The num_heads is divisible by num_layout, and max_blocks is max_sequence_length / sparse_block_size.
block_col_indices (heterogeneous) – M : object, the col indices of CSR format of block mask with shape (num_layout, max_nnz_blocks).The max_nnz_blocks is the maximum number of non-zeros per layout in block mask.
total_sequence_lengths (heterogeneous) – M : object, scalar tensor of maximum total sequence length (past_sequence_length + sequence_length) among keys.
key_total_sequence_lengths (heterogeneous) – M : object, 1D tensor with shape (batch_size) where each value is total sequence length of key excluding paddings.
cos_cache (optional, heterogeneous) – T : object, cos cache of rotary with shape (max_rotary_sequence_length, head_size / 2).
sin_cache (optional, heterogeneous) – T : object, sin cache of rotary with shape (max_rotary_sequence_length, head_size / 2).

Parameters : cluster,

do rotary : boolean, whether to use rotary position embedding.
Default value “False”.
kv_num_heads : integer, number of attention heads for key and value.
Default value “0”.
num_heads : integer, number of attention heads for query.
Default value “0”.
rotary_interleaved : boolean, rotary use interleaved pattern or not.
Default value “False”.
scale : float, scaling factor applied prior to softmax. The default value is 1/sqrt(head_size).
Default value “0”.
sparse_block_size : enum, number of tokens per sparse block.
Default value “16”.
training? : boolean, whether the layer is in training mode (can store data for backward).
Default value “True”.
lda coeff : float, defines the coefficient by which the loss derivative will be multiplied before being sent to the previous layer (since during the backward run we go backwards).
Default value “1”.

name (optional) : string, name of the node.

Output parameters

Graphs out : cluster, ONNX model architecture.

output (heterogeneous) – T : object, 3D output tensor with shape (batch_size, sequence_length, num_heads * head_size).
present_key (heterogeneous) – T : object, updated key cache with shape (batch_size, kv_num_heads, max_cache_sequence_length, head_size).
present_value (heterogeneous) – T : object, updated value cache with shape (batch_size, kv_num_heads, max_cache_sequence_length, head_size).

Type Constraints

T in (tensor(float), tensor(float16), tensor(bfloat16)) : Constrain input and output to float tensors.

M in (tensor(int32)) : Constrain integer type.

Example

All these exemples are snippets PNG, you can drop these Snippet onto the block diagram and get the depicted code added to your VI (Do not forget to install Deep Learning library to run it).

SOTA

Installation guide

General

Accelerator

Installation guide

Execution providers

General

Execution

CUDA Advanced

Construct Ptr Input Data

Index

Name

Exec

Exec

Mono

Multi

Input

Deep Learning

Installation guide

Execution providers

General

Architecture

Layers

Nodes

Nodes

Activation

Mono Input

Parameters

Graph Function

Graph

File

Get & Set

Runtime

Create

Academic Training

Inference

Training

Exec

Academic

Input

Advanced

Add Weight

Index

Name

Convert

From ONNX

To ONNX

Format Weight

Get Weight

More

Layers parameters

Nodes Parameters

SparseAttention

Description

Input parameters

Output parameters

Type Constraints

Example