Overview
Quantization converts model weights from higher precision formats (like float32) to lower precision formats, reducing memory footprint while maintaining acceptable quality.Configuration
Quantization is configured in the model’s YAML config file. The following parameters control quantization behavior:Basic parameters
Parameter descriptions
quantization- Type: string
- Default:
''(empty, disabled) - Description: Specifies the quantization mode. When empty, quantization is disabled.
- Type: integer
- Default:
-1(auto-detect from number of slices) - Description: Shards the range-finding operation for quantization across devices. By default, this is set to the number of slices in your TPU configuration.
- Type: boolean
- Default:
False - Description: Enables QWIX (Quantization with Index) optimization for quantization.
- Type: integer
- Default:
-1(auto-detect) - Description: Number of target slices for compilation topology. Set to a positive integer to override auto-detection.
Example configuration
Frombase_xl.yml:
Usage with training
When training models, you can enable quantization by setting thequantization parameter:
Usage with inference
Quantization settings are loaded from the config file during inference:Multi-slice quantization
When running on multi-slice TPU pods, the quantization range-finding operation is automatically sharded across slices for efficiency. You can control this behavior withquantization_local_shard_count:
Performance considerations
Memory savings
- Quantization can significantly reduce model memory footprint
- Exact savings depend on the quantization mode and model architecture
- Particularly beneficial for large models like SDXL and Flux
Quality impact
- Lower precision formats may introduce minor quality degradation
- Test different quantization modes to find the right balance for your use case
- Some quantization modes preserve quality better than others
Inference speed
- Quantized models may have faster inference on certain hardware
- TPUs can benefit from reduced memory bandwidth requirements
- Actual speedup varies by model and quantization type
Best practices
- Start without quantization: Establish baseline quality and performance
- Test incrementally: Try different quantization modes and compare results
- Monitor quality: Use consistent prompts to evaluate quality impact
- Profile performance: Measure actual memory and speed improvements
- Match topology: Set
compile_topology_num_slicesto match your deployment environment
Related resources
- Profiling - Monitor performance with quantization
- Checkpointing - Save quantized model states