GLM-4 models

GLM-4 is a 9B parameter language model from Tsinghua University with a distinctive architecture featuring partial RoPE, fused MLP projections, and extra layer normalization for improved stability.

Features

Partial RoPE: Rotary position embedding on half of head dimensions
Fused MLP: Combined gate_up_proj for better efficiency
Extra LayerNorms: Post-attention and post-MLP normalization layers
4-bit quantization: Required for consumer hardware (6 GB vs 18 GB)
Step-based KV cache: Memory-efficient generation

Installation

Add to your Cargo.toml:

[dependencies]
glm4-mlx = { path = "../glm4-mlx" }
mlx-rs = "0.18"

Quick start

Download model

Download the 4-bit quantized model (recommended):

huggingface-cli download mlx-community/glm-4-9b-chat-4bit \
    --local-dir ./models/GLM-4-9B-4bit

Or the full precision model (requires 18GB+ memory):

huggingface-cli download mlx-community/glm-4-9b-chat-bf16 \
    --local-dir ./models/GLM-4-9B

Run generation example

cargo run --release --example generate_glm4 -- \
    ./models/GLM-4-9B-4bit "你好"

Use in your code

use glm4_mlx::{load_model, load_tokenizer, Generate, KVCache};
use mlx_rs::ops::indexing::NewAxis;

let tokenizer = load_tokenizer("./models/GLM-4-9B-4bit")?;
let mut model = load_model("./models/GLM-4-9B-4bit")?;

let encoding = tokenizer.encode("你好，", true)?;
let prompt = mlx_rs::Array::from(encoding.get_ids()).index(NewAxis);

let mut cache = Vec::new();
let generator = Generate::<KVCache>::new(&mut model, &mut cache, 0.7, &prompt);

for token in generator.take(100) {
    let token = token?;
    print!("{}", tokenizer.decode(&[token.item::<u32>()], true)?);
}

Architecture details

GLM-4 uses several unique architectural features:

Partial RoPE

Unlike standard transformers that apply rotary position embedding to all head dimensions, GLM-4 only applies RoPE to the first half (partial_rotary_factor = 0.5). This reduces computation while maintaining positional awareness:

// Standard RoPE: applied to all dims
let rope_dims = head_dim;  // e.g., 128

// GLM-4 partial RoPE: applied to first half
let rope_dims = head_dim / 2;  // e.g., 64

Fused gate_up_proj

The MLP layer uses a single projection to 2×hidden_dim, then splits for gate and up paths:

x → gate_up_proj(dim → 2×dim) → split → [gate, up]
                                            ↓      ↓
                                         silu()    |
                                            ↓      |
                                         multiply ←┘
                                            ↓
                                    down_proj(2×dim → dim)

This is more efficient than separate projections:

// Traditional approach (2 matrix multiplies)
let gate = gate_proj.forward(&x)?;
let up = up_proj.forward(&x)?;

// GLM-4 fused approach (1 matrix multiply + split)
let gate_up = gate_up_proj.forward(&x)?;
let (gate, up) = gate_up.split(2, -1)?;

Extra LayerNorms

Each decoder layer has 4 LayerNorm operations:

input_layernorm - Before attention
post_self_attn_layernorm - After attention, before residual
post_attention_layernorm - Before MLP
post_mlp_layernorm - After MLP, before residual

This provides better gradient flow and training stability compared to standard transformers with only 2 LayerNorms per block.

Code example

From examples/generate_glm4.rs:

use glm4_mlx::{load_model, load_tokenizer, Generate, KVCache};
use mlx_rs::ops::indexing::NewAxis;
use mlx_rs::transforms::eval;
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_dir = "./models/GLM-4-9B-4bit";
    let prompt = "你好，请介绍一下自己。";

    println!("Loading model from: {}", model_dir);
    let start = Instant::now();

    let tokenizer = load_tokenizer(model_dir)?;
    let mut model = load_model(model_dir)?;

    println!("Model loaded in {:.2}s", start.elapsed().as_secs_f32());

    // Tokenize
    let encoding = tokenizer.encode(prompt, true)?;
    let prompt_tokens = mlx_rs::Array::from(encoding.get_ids()).index(NewAxis);

    println!("Prompt ({} tokens): {}", encoding.get_ids().len(), prompt);
    println!("---");

    // Generate
    let mut cache = Vec::new();
    let generator = Generate::<KVCache>::new(&mut model, &mut cache, 0.7, &prompt_tokens);

    let mut tokens = Vec::new();
    for (i, token) in generator.enumerate() {
        let token = token?;
        tokens.push(token.clone());

        // Decode in batches
        if tokens.len() % 10 == 0 {
            eval(&tokens)?;
            let slice: Vec<u32> = tokens.drain(..).map(|t| t.item::<u32>()).collect();
            let text = tokenizer.decode(&slice, true)?;
            print!("{}", text);
        }

        if i >= 100 { break; }
    }

    // Flush remaining
    if !tokens.is_empty() {
        eval(&tokens)?;
        let slice: Vec<u32> = tokens.drain(..).map(|t| t.item::<u32>()).collect();
        print!("{}", tokenizer.decode(&slice, true)?);
    }
    println!();

    Ok(())
}

Supported models

GLM-4-9B (bf16)

Size: 18 GB
Precision: bfloat16
Use case: Maximum quality (requires 32GB+ RAM)
Download:

huggingface-cli download mlx-community/glm-4-9b-chat-bf16 \
    --local-dir ./models/GLM-4-9B

GLM-4-9B (4-bit)

Size: 6 GB
Precision: 4-bit quantized
Use case: Recommended for consumer hardware
Download:

huggingface-cli download mlx-community/glm-4-9b-chat-4bit \
    --local-dir ./models/GLM-4-9B-4bit

Converting models

Convert from HuggingFace with 4-bit quantization:

pip install mlx-lm
mlx_lm.convert --hf-path THUDM/glm-4-9b-chat -q

Without quantization:

mlx_lm.convert --hf-path THUDM/glm-4-9b-chat

Model configuration

GLM-4-9B configuration:

{
  "hidden_size": 4096,
  "num_hidden_layers": 40,
  "num_attention_heads": 32,
  "num_key_value_heads": 2,
  "intermediate_size": 13696,
  "partial_rotary_factor": 0.5,
  "vocab_size": 151552,
  "rope_theta": 10000.0,
  "rms_norm_eps": 1.5625e-07
}

Key parameters:

Grouped Query Attention: 32 query heads, 2 KV heads (16:1 ratio)
Partial RoPE: 0.5 factor means RoPE applied to 64 of 128 head dimensions
Large intermediate size: 13696 dims (3.34× hidden size)

Performance considerations

Memory requirements

Model	Weights	KV Cache (2K ctx)	Total
GLM-4-9B (bf16)	18 GB	~1 GB	~19 GB
GLM-4-9B (4-bit)	6 GB	~1 GB	~7 GB

The 4-bit model fits comfortably on M1/M2/M3 devices with 16GB+ unified memory.

Inference speed

On Apple M3 Max (estimated based on architecture):

Prompt processing: ~200 tok/s (4-bit)
Token generation: ~50 tok/s (4-bit)

Similar to Qwen3-8B given comparable parameter count and architecture complexity.

Chinese language support

GLM-4 is optimized for Chinese language understanding with:

Extended Chinese vocabulary (151K tokens)
Training on large Chinese corpora
Better tokenization efficiency for Chinese text

Use Chinese prompts for best results:

let prompt = "你好，请介绍一下自己。";  // "Hello, please introduce yourself."

API reference

Loading functions

pub fn load_model(model_dir: impl AsRef<Path>) -> Result<Model, Error>
pub fn load_tokenizer(model_dir: impl AsRef<Path>) -> Result<Tokenizer, Error>

Generation

pub struct Generate<C: KeyValueCache> {
    // fields omitted
}

impl<C: KeyValueCache> Generate<C> {
    pub fn new(
        model: &mut Model,
        cache: &mut Vec<C>,
        temperature: f32,
        prompt: &Array,
    ) -> Self
}

Iterator yielding generated tokens. Use temperature = 0.0 for greedy decoding.

Troubleshooting

Model loads slowly

GLM-4-9B has 40 layers which takes time to load. Use bf16 → 4-bit quantization to reduce load time:

# Quantize existing bf16 model
mlx_lm.convert --hf-path ./models/GLM-4-9B -q

Out of memory

GLM-4-9B (bf16) requires 20GB+ memory. Solutions:

Use 4-bit quantized model instead
Close other applications
Reduce generation length

Unexpected Chinese output

GLM-4 is trained primarily on Chinese text. For English prompts, you may get mixed-language responses. This is expected behavior.

Qwen3 - Alternative 9B model with similar performance

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Features

Installation

Quick start

Architecture details

Partial RoPE

Fused gate_up_proj

Extra LayerNorms

Code example

Supported models

GLM-4-9B (bf16)

GLM-4-9B (4-bit)

Converting models

Model configuration

Performance considerations

Memory requirements

Inference speed

Chinese language support

API reference

Loading functions

Generation

Troubleshooting

Model loads slowly

Out of memory

Unexpected Chinese output

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​Features

​Installation

​Quick start

​Architecture details

​Partial RoPE

​Fused gate_up_proj

​Extra LayerNorms

​Code example

​Supported models

GLM-4-9B (bf16)

GLM-4-9B (4-bit)

​Converting models

​Model configuration

​Performance considerations

​Memory requirements

​Inference speed

​Chinese language support

​API reference

​Loading functions

​Generation

​Troubleshooting

​Model loads slowly

​Out of memory

​Unexpected Chinese output

​Related models

Build docs developers (and LLMs) love

Features

Installation

Quick start

Architecture details

Partial RoPE

Fused gate_up_proj

Extra LayerNorms

Code example

Supported models

Converting models

Model configuration

Performance considerations

Memory requirements

Inference speed

Chinese language support

API reference

Loading functions

Generation

Troubleshooting

Model loads slowly

Out of memory

Unexpected Chinese output

Related models