FAST: Efficient Robot Action Tokenization
-
Background/problem
- Autoregressive VLAs predict discrete tokens, but robot actions are continuous (e.g., arm joint angles).
- Current VLAs typically use a simple binning discretization scheme, where the range of values are split into, say, 256 buckets.
- The issue here is at least two-fold: a. Too many tokens per action chunk (expensive), especially when control rate is high. b. High redundancy between tokens (the model can have low loss by copying the last token, learning stalls).
-
Their solution
- FAST tokenizer steps:
-
Normalize action chunk.
This initial normalization step is useful to bring the data into a specified range and also makes tokenization of cross-embodied datasets with different action scales easier.
-
Apply Discrete Cosine Transform (DCT) (and compress via rounding)
- The DCT rewrites a signal as a mix of smooth cosine waves at different frequencies. Instead of storing every sample, you just store the strengths of those waves, which capture both the overall trend (low frequency) and fine details (high frequency).
-
Quantize to get sparse frequency matrix.
To compress the DCT-converted signal we can simply omit insignificant coefficients, which we implement through a scale-and-round operation, where the scaling coefficient is a hyperparameter that trades off between lossiness and compression rate of the tokenization operation.
-
Low-frequency components first.
-
Byte Pair Encoding (BPE) to get compressed action tokens.
The BPE step “squashes” the zero-valued components and merges frequently-occurring coefficient combinations across action dimensions. We choose BPE to compress the DCT matrix, since many efficient implementations exist and it can produce a fixed-size output vocabulary that can be easily integrated into the existing vocabulary of vision-language models for VLA training.
-
- FAST tokenizer steps:
-
Empirics, briefly
- π0-FAST, trains 5x faster than the original π0 model, and achieves similar performance.
- Control rate = speed at which commands are issued to the robot. Hz = times per second.