FLUX.2-klein

FLUX.2-klein is a compact diffusion model optimized for fast generation. With only 4 denoising steps and optional INT8 quantization, it provides excellent speed-quality balance for Apple Silicon.

Features

4B parameter transformer: 5 double blocks + 20 single blocks
Qwen3-4B text encoder: 36 layers, 2560 hidden dimensions
4-step generation: Rectified flow with SNR-shifted schedule
INT8 quantization: Optional memory reduction (13GB → 8GB)
AutoencoderKL VAE: 32 latent channels for high-quality decoding

Installation

The model downloads automatically from HuggingFace (gated model - requires authentication).

# Login to HuggingFace
huggingface-cli login

# Or set token via environment
export HF_TOKEN=your_token_here

Run generation

The model will download automatically on first run:

cargo run --example generate_klein --release -- "a beautiful sunset"

Manual download

For offline use or custom model paths:

# Download with huggingface-cli
huggingface-cli download black-forest-labs/FLUX.2-klein-4B --local-dir ./models/flux

# Or use git lfs
git lfs install
git clone https://huggingface.co/black-forest-labs/FLUX.2-klein-4B ./models/flux

# Set custom path
export FLUX_MODEL_DIR=./models/flux

Usage

Command line

# Default: 512x512, 4 steps, FP32
cargo run --example generate_klein --release -- "a cat sitting on a windowsill"

Library usage

use flux_klein_mlx::{
    FluxKlein, FluxKleinParams,
    Qwen3TextEncoder, Qwen3Config,
    Decoder, AutoEncoderConfig,
    load_safetensors,
};
use mlx_rs::Array;
use std::collections::HashMap;

// Load text encoder
let qwen3_config = Qwen3Config {
    hidden_size: 2560,
    num_hidden_layers: 36,
    intermediate_size: 9728,
    num_attention_heads: 32,
    num_key_value_heads: 8,
    rms_norm_eps: 1e-6,
    vocab_size: 151936,
    max_position_embeddings: 40960,
    rope_theta: 1000000.0,
    head_dim: 128,
};
let mut text_encoder = Qwen3TextEncoder::new(qwen3_config)?;

// Load transformer
let params = FluxKleinParams::default();
let mut transformer = FluxKlein::new(params)?;

// Load VAE decoder
let vae_config = AutoEncoderConfig::flux2();
let mut vae = Decoder::new(vae_config)?;

// Load weights from safetensors
// ... (see full example for weight loading)

// Encode text prompt
let txt_embed = text_encoder.encode(&input_ids, Some(&attention_mask))?;

// Generate latents with denoising
let (rope_cos, rope_sin) = FluxKlein::compute_rope(&txt_ids, &img_ids)?;
let latent = transformer.forward_with_rope(
    &noise, &txt_embed, &timestep, &rope_cos, &rope_sin
)?;

// Decode to image
let image = vae.forward(&latent)?;

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    FLUX.2-klein Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐    ┌──────────────────┐    ┌───────────┐  │
│  │   Qwen3-4B  │    │  FLUX Transformer │    │    VAE    │  │
│  │   Encoder   │───▶│  5 double blocks  │───▶│  Decoder  │  │
│  │  36 layers  │    │ 20 single blocks  │    │  32 ch    │  │
│  └─────────────┘    └──────────────────┘    └───────────┘  │
│        │                    │                      │        │
│    [B,512,2560]        [B,1024,128]          [B,H,W,3]     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key components

Text encoder (Qwen3-4B)

36 transformer layers with RMSNorm
2560 hidden dimensions
Grouped query attention (32 heads, 8 KV heads)
RoPE position embeddings (θ=1,000,000)
Outputs: [batch, 512, 2560]

Transformer (FLUX.2-klein)

5 double blocks: Joint image-text attention
20 single blocks: Image-only processing
4-axis RoPE: [32, 32, 32, 32] for (T, H1, H2, W)
Patch size: 2x2 on latent space
Input channels: 128 (32 VAE channels × 2×2 patch)

VAE decoder

32 latent channels (FLUX.2 uses more than FLUX.1)
8× upsampling: 64×64 → 512×512
Outputs RGB images in [-1, 1] range

Denoising schedule

FLUX.2-klein uses a SNR-shifted rectified flow schedule:

// Compute empirical mu based on sequence length
let mu = compute_empirical_mu(image_seq_len, num_steps);

// Apply SNR shift
fn time_shift(t: f32, mu: f32, sigma: f32) -> f32 {
    mu.exp() / (mu.exp() + (1.0/t - 1.0).powf(sigma))
}

// Generate timesteps
for i in 0..=num_steps {
    let t_linear = 1.0 - (i as f32) / (num_steps as f32);
    let t_shifted = time_shift(t_linear, mu, 1.0);
    timesteps.push(t_shifted);
}

// Euler integration
for step in 0..num_steps {
    let v_pred = transformer.forward(...);
    let dt = timesteps[step + 1] - timesteps[step];
    latent = latent + dt * v_pred;  // Euler step
}

Performance

On Apple M3 Max (128GB):

Mode	Memory	Time (512×512)	Quality
FP32	~13GB	~5s	Excellent
INT8	~8GB	~6s	Very good

Performance tips

INT8 quantization provides excellent quality with minimal degradation while reducing memory by ~40%.

Use --quantize for lower memory usage
Increase steps to 8 for higher quality
Default 4 steps work well for most use cases
Quantization adds ~1s overhead but saves 5GB RAM

Configuration

Environment variables

# Custom model directory
export FLUX_MODEL_DIR=/path/to/flux-model

# HuggingFace token (for gated model)
export HF_TOKEN=your_token_here

# Run with custom path
FLUX_MODEL_DIR=./models/flux cargo run --example generate_klein --release

Model files structure

models/flux/
├── transformer/
│   └── diffusion_pytorch_model.safetensors  # ~8GB
├── text_encoder/
│   ├── model-00001-of-00002.safetensors     # ~5GB
│   └── model-00002-of-00002.safetensors     # ~5GB
├── vae/
│   └── diffusion_pytorch_model.safetensors  # ~160MB
└── tokenizer/
    └── tokenizer.json

Examples

Basic generation

cargo run --example generate_klein --release -- \
  "a beautiful sunset over the ocean"

High-quality generation

cargo run --example generate_klein --release -- \
  --steps 8 \
  "detailed portrait of a knight in shining armor"

Memory-efficient generation

cargo run --example generate_klein --release -- \
  --quantize \
  "a cat sitting on a windowsill, warm lighting"

Troubleshooting

Model download fails

FLUX.2-klein is a gated model. Make sure you:

Accept the license on HuggingFace
Login with huggingface-cli login
Or set HF_TOKEN environment variable

Out of memory

Try these solutions:

Use --quantize flag to enable INT8 quantization
Close other applications
Ensure you have at least 16GB unified memory

Generation is slow

Use --release flag when building
Reduce steps to 4 (default)
First run includes model download and compilation

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Features

Installation

Manual download

Usage

Command line

Library usage

Architecture

Key components

Denoising schedule

Performance

Performance tips

Configuration

Environment variables

Model files structure

Examples

Basic generation

High-quality generation

Memory-efficient generation

Troubleshooting

Resources

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​Features

​Installation

​Manual download

​Usage

​Command line

​Library usage

​Architecture

​Key components

​Denoising schedule

​Performance

​Performance tips

​Configuration

​Environment variables

​Model files structure

​Examples

​Basic generation

​High-quality generation

​Memory-efficient generation

​Troubleshooting

​Resources

Build docs developers (and LLMs) love

Features

Installation

Manual download

Usage

Command line

Library usage

Architecture

Key components

Denoising schedule

Performance

Performance tips

Configuration

Environment variables

Model files structure

Examples

Basic generation

High-quality generation

Memory-efficient generation

Troubleshooting

Resources