Features
- Partial RoPE: Rotary position embedding on half of head dimensions
- Fused MLP: Combined gate_up_proj for better efficiency
- Extra LayerNorms: Post-attention and post-MLP normalization layers
- 4-bit quantization: Required for consumer hardware (6 GB vs 18 GB)
- Step-based KV cache: Memory-efficient generation
Installation
Add to yourCargo.toml:
Quick start
Download model
Download the 4-bit quantized model (recommended):Or the full precision model (requires 18GB+ memory):
Architecture details
GLM-4 uses several unique architectural features:Partial RoPE
Unlike standard transformers that apply rotary position embedding to all head dimensions, GLM-4 only applies RoPE to the first half (partial_rotary_factor = 0.5).
This reduces computation while maintaining positional awareness:
Fused gate_up_proj
The MLP layer uses a single projection to2×hidden_dim, then splits for gate and up paths:
Extra LayerNorms
Each decoder layer has 4 LayerNorm operations:input_layernorm- Before attentionpost_self_attn_layernorm- After attention, before residualpost_attention_layernorm- Before MLPpost_mlp_layernorm- After MLP, before residual
Code example
From examples/generate_glm4.rs:Supported models
GLM-4-9B (bf16)
Size: 18 GB
Precision: bfloat16
Use case: Maximum quality (requires 32GB+ RAM)
Download:
Precision: bfloat16
Use case: Maximum quality (requires 32GB+ RAM)
Download:
GLM-4-9B (4-bit)
Size: 6 GB
Precision: 4-bit quantized
Use case: Recommended for consumer hardware
Download:
Precision: 4-bit quantized
Use case: Recommended for consumer hardware
Download:
Converting models
Convert from HuggingFace with 4-bit quantization:Model configuration
GLM-4-9B configuration:- Grouped Query Attention: 32 query heads, 2 KV heads (16:1 ratio)
- Partial RoPE: 0.5 factor means RoPE applied to 64 of 128 head dimensions
- Large intermediate size: 13696 dims (3.34× hidden size)
Performance considerations
Memory requirements
| Model | Weights | KV Cache (2K ctx) | Total |
|---|---|---|---|
| GLM-4-9B (bf16) | 18 GB | ~1 GB | ~19 GB |
| GLM-4-9B (4-bit) | 6 GB | ~1 GB | ~7 GB |
Inference speed
On Apple M3 Max (estimated based on architecture):- Prompt processing: ~200 tok/s (4-bit)
- Token generation: ~50 tok/s (4-bit)
Chinese language support
GLM-4 is optimized for Chinese language understanding with:- Extended Chinese vocabulary (151K tokens)
- Training on large Chinese corpora
- Better tokenization efficiency for Chinese text
API reference
Loading functions
Generation
temperature = 0.0 for greedy decoding.
Troubleshooting
Model loads slowly
GLM-4-9B has 40 layers which takes time to load. Use bf16 → 4-bit quantization to reduce load time:Out of memory
GLM-4-9B (bf16) requires 20GB+ memory. Solutions:- Use 4-bit quantized model instead
- Close other applications
- Reduce generation length
Unexpected Chinese output
GLM-4 is trained primarily on Chinese text. For English prompts, you may get mixed-language responses. This is expected behavior.Related models
- Qwen3 - Alternative 9B model with similar performance