Overview
Z-Image-Turbo is a 6B parameter Single-Stream DiT (S3-DiT) optimized for fast image generation with 9-step turbo inference.Key features
- Z-Image-Turbo transformer: 6B parameter model with Noise Refiner, Context Refiner, and 30 Joint blocks
- Qwen3-4B text encoder: Extracts layer 34 (2560-dim embeddings)
- 3-axis RoPE: Position encoding with dimensions [32, 48, 48] and theta=256
- 9-step turbo inference: Distilled for fast generation
- 4-bit quantization: Memory-efficient inference (~3GB vs ~12GB)
Architecture differences from FLUX
- Uses Noise Refiner + Context Refiner + Joint blocks (vs Double + Single)
- 3-axis RoPE [32, 48, 48] with theta=256 (vs 4-axis)
- Per-block AdaLN with tanh gates
- Qwen3-4B layer 34 extraction (vs concat layers 8, 17, 26)
Core types
ZImageTransformer
The main transformer model for Z-Image-Turbo.Model configuration parameters
Methods
Create a new Z-Image transformer.
Model configuration. Use
ZImageConfig::default() for standard 6B model.forward
fn(&mut self, x: &Array, t: &Array, cap_feats: &Array, x_pos: &Array, cap_pos: &Array, x_mask: Option<&Array>, cap_mask: Option<&Array>) -> Result<Array, Exception>
Run forward pass through the transformer.
Image latents
[batch, img_seq, in_channels * patch^2] where in_channels=16Timesteps
[batch] scaled by t_scale=1000.0Caption features from Qwen3 layer 34
[batch, cap_seq, 2560]Image position coordinates
[batch, img_seq, 3] for (h, w, t)Caption position coordinates
[batch, cap_seq, 3]Optional image mask for padding
Optional caption mask for padding
Predicted velocity
[batch, img_seq, in_channels * patch^2]forward_with_rope
fn(&mut self, x: &Array, t: &Array, cap_feats: &Array, x_pos: &Array, cap_pos: &Array, cos: &Array, sin: &Array, x_mask: Option<&Array>, cap_mask: Option<&Array>) -> Result<Array, Exception>
ZImageConfig
Configuration for Z-Image-Turbo.Model dimension (hidden size)
Number of attention heads
Number of joint transformer blocks
Number of noise refiner and context refiner blocks (2 each)
Caption feature dimension from Qwen3 layer 34
3-axis RoPE dimensions for (h, w, t) axes
RoPE base frequency (different from FLUX’s 2000.0)
Timestep scaling factor
ZImageTransformerBlock
Single transformer block with optional AdaLN modulation.Whether block uses AdaLN modulation (true for noise refiner and joint blocks, false for context refiner)
3-axis RoPE utilities
create_coordinate_grid
fn(size: (i32, i32, i32), start: (i32, i32, i32)) -> Result<Array, Exception>
compute_rope_3axis
fn(positions: &Array, axes_dims: &[i32; 3], theta: f32) -> Result<(Array, Array), Exception>
Quantization
ZImageTransformerQuantized
4-bit quantized Z-Image transformer for reduced memory.load_quantized_zimage_transformer
fn(weights: HashMap<String, Array>, config: ZImageConfig) -> Result<ZImageTransformerQuantized, Exception>
QuantizedQwen3TextEncoder
4-bit quantized Qwen3 text encoder.load_quantized_qwen3_encoder
fn(weights: HashMap<String, Array>, config: Qwen3Config) -> Result<QuantizedQwen3TextEncoder, Exception>
Weight utilities
Sanitize Z-Image weights from PyTorch diffusers format.
Re-exported from flux-klein-mlx
Z-Image shares several components with FLUX.2-klein:Qwen3Config,Qwen3TextEncoder,sanitize_qwen3_weightsDecoder,AutoEncoderConfigload_safetensors,sanitize_vae_weightsFluxSampler,FluxSamplerConfig
Performance comparison
| Mode | Memory | Speed |
|---|---|---|
| Dequantized (f32) | ~12GB | ~1.87s/step |
| Quantized (4-bit) | ~3GB | ~2.08s/step |