Unified memory

What is unified memory?

Unified memory is a hardware architecture where the CPU and GPU share the same physical memory (DRAM) rather than having separate memory pools. On Apple Silicon, both processing units can directly access any memory address without copying data between different memory spaces.

Traditional GPU architecture (NVIDIA/AMD)

┌──────────────┐         ┌──────────────┐
│     CPU      │         │     GPU      │
│              │         │              │
│  ┌────────┐  │         │  ┌────────┐  │
│  │  Core  │  │         │  │  CUDA  │  │
│  │  Core  │  │         │  │  Core  │  │
│  │  Core  │  │         │  │  Core  │  │
│  └────────┘  │         │  └────────┘  │
└──────┬───────┘         └──────┬───────┘
       │                        │
       │                        │
   ┌───▼────┐              ┌───▼────┐
   │  RAM   │              │  VRAM  │
   │ (DDR4) │              │ (GDDR6)│
   │ 16 GB  │              │  8 GB  │
   └────────┘              └────────┘
        │                        │
        └────────────┬───────────┘
                     │
              PCIe Bus (copy required)

Data must be explicitly copied between CPU RAM and GPU VRAM:

High latency (microseconds to milliseconds)
Limited bandwidth (PCIe 4.0: ~32 GB/s)
Double memory usage (same data in both memories)

Apple Silicon unified memory

┌─────────────────────────────────────────────┐
│              Apple Silicon SoC              │
│                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────┐  │
│  │   CPU    │    │   GPU    │    │Neural│  │
│  │          │    │          │    │Engine│  │
│  │ P-cores  │    │  Metal   │    │      │  │
│  │ E-cores  │    │  Cores   │    │      │  │
│  └────┬─────┘    └────┬─────┘    └──┬───┘  │
│       │               │              │      │
│       └───────────────┼──────────────┘      │
│                       │                     │
│              ┌────────▼────────┐            │
│              │ Memory Fabric   │            │
│              │ (800 GB/s)      │            │
│              └────────┬────────┘            │
└───────────────────────┼─────────────────────┘
                        │
                  ┌─────▼──────┐
                  │   LPDDR5   │
                  │   (Unified)│
                  │  16-128 GB │
                  └────────────┘

Key advantages:

Zero-copy access: CPU and GPU read/write the same memory addresses
Ultra-high bandwidth: 400-800 GB/s (vs. 32 GB/s PCIe)
Low latency: Direct memory access (nanoseconds)
Larger effective memory: No duplication needed

On an M3 Max with 128GB, the full 128GB is available to both CPU and GPU, unlike a traditional setup where you might have 64GB RAM + 24GB VRAM with data duplication.

How MLX leverages unified memory

No explicit device transfers

With MLX, arrays live in unified memory and can be accessed by any device without copying:

use mlx_rs::{array, ops, StreamOrDevice};

// Create array - lives in unified memory
let x = array!([[1.0, 2.0], [3.0, 4.0]]);

// CPU operation - direct access, no copy
let y = ops::add(&x, &x, StreamOrDevice::cpu())?;

// GPU operation - same array, no copy needed
let z = ops::matmul(&x, &y, StreamOrDevice::gpu())?;

// Both CPU and GPU can access z without copying
let cpu_sum = ops::sum(&z, None, StreamOrDevice::cpu())?;
let gpu_norm = ops::sqrt(&ops::sum(&ops::square(&z)?, None, StreamOrDevice::gpu())?))?;

Contrast with PyTorch (CUDA):

import torch

x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # CPU memory
x_gpu = x.cuda()                             # Copy to GPU (explicit)
y = x_gpu @ x_gpu.T                          # GPU operation
y_cpu = y.cpu()                              # Copy back to CPU (explicit)

Device as execution hint

In MLX, the device parameter tells MLX where to execute the operation, not where the data lives:

use mlx_rs::{random, ops, StreamOrDevice};

let a = random::normal(&[1000, 1000], None, None, None)?;

// Same array, different execution devices
let b = ops::matmul(&a, &a, StreamOrDevice::cpu())?;  // Execute on CPU
let c = ops::matmul(&a, &a, StreamOrDevice::gpu())?;  // Execute on GPU

// No data movement between these operations

See mlx-rs/src/lib.rs:202 for the unified memory documentation.

Benefits for ML inference

1. Larger models fit in memory

Without memory duplication, you can load larger models: Traditional GPU (48GB RAM + 24GB VRAM):

Model weights: 12GB (needs to be in VRAM)
Activations: ~8GB (in VRAM during inference)
Input data: 4GB (in RAM, copied to VRAM)
Total: 12GB VRAM used, 4GB RAM used
Limit: Cannot load models > 24GB

Apple Silicon (64GB unified):

Model weights: 12GB (accessible by both CPU/GPU)
Activations: ~8GB (computed on GPU, accessible to CPU)
Input data: 4GB (no copy needed)
Total: 24GB unified memory used
Available: 40GB for larger models or batches

On M3 Max 128GB, you can run quantized 70B parameter models (~35GB in 4-bit) with room for large context windows.

2. Faster mixed CPU/GPU workloads

Some ML pipelines naturally split across CPU and GPU: LLM generation with CPU sampling:

use mlx_rs::{ops, random, StreamOrDevice};
use qwen3_mlx::Model;

// GPU: Compute logits (expensive)
let logits = model.forward(&input_ids, &cache, StreamOrDevice::gpu())?;

// CPU: Sample next token (cheap, easier on CPU)
let next_token = sample_token(&logits, temperature, StreamOrDevice::cpu())?;

// No GPU→CPU copy needed for logits array

ASR with CPU audio preprocessing:

use mlx_rs_core::audio::{load_wav, resample};
use funasr_mlx::transcribe;

// CPU: Load and preprocess audio (I/O bound)
let (samples, rate) = load_wav("audio.wav")?;
let samples_16k = resample(&samples, rate, 16000);

// GPU: Run encoder/decoder (compute bound)
let transcript = transcribe(&model, &samples_16k, StreamOrDevice::gpu())?;

// No CPU→GPU copy for audio samples

3. Efficient attention with large contexts

Attention mechanisms require accessing large key-value caches:

use mlx_rs_core::{ConcatKeyValueCache, scaled_dot_product_attention};

// KV cache grows with context length (e.g., 10K tokens)
let mut cache: Vec<Option<ConcatKeyValueCache>> = vec![None; num_layers];

// GPU computes attention
let (keys, values) = cache[layer].update_and_fetch(new_k, new_v)?;
let attention_out = scaled_dot_product_attention(
    queries, keys, values,
    None, scale, Some(SdpaMask::Causal),
    StreamOrDevice::gpu()
)?;

// No memory copy - cache accessible to both CPU (for updates) and GPU (for attention)

With separate memories, the KV cache would need to:

Be stored in VRAM (limited)
Be copied from RAM if it exceeds VRAM
Limit maximum context length

4. Reduced latency for small batches

ML inference often uses small batch sizes (1-4). With traditional GPUs, the PCIe copy overhead dominates: Traditional GPU (batch_size=1):

Input copy: 0.5ms
GPU computation: 2ms
Output copy: 0.3ms
Total: 2.8ms (28% overhead)

Apple Silicon (batch_size=1):

Input copy: 0ms (unified memory)
GPU computation: 2ms
Output copy: 0ms (unified memory)
Total: 2ms (0% overhead)

This 40% speedup compounds across many inference calls.

Memory bandwidth

Unified memory provides massive bandwidth between processors:

Device	Memory Bandwidth	Notes
M1 Max	400 GB/s	512-bit LPDDR5
M2 Max	400 GB/s	512-bit LPDDR5
M3 Max	400 GB/s	512-bit LPDDR5
M3 Ultra	800 GB/s	Dual M3 Max connected
M4 Max	546 GB/s	512-bit LPDDR5X
vs. PCIe 4.0	32 GB/s	CPU ↔ GPU transfer
vs. NVIDIA H100	3,350 GB/s	HBM3 (on-chip only)

While NVIDIA’s HBM3 has higher bandwidth, it’s only between GPU and VRAM. CPU-GPU transfers still go through PCIe (32 GB/s).

Memory coherency

MLX handles memory coherency automatically:

use mlx_rs::{array, ops, StreamOrDevice};

let mut x = array!([1.0, 2.0, 3.0]);

// GPU modifies x
let x = ops::multiply(&x, 2.0, StreamOrDevice::gpu())?;
x.eval()?;  // GPU writes: [2.0, 4.0, 6.0]

// CPU reads updated value
let sum = ops::sum(&x, None, StreamOrDevice::cpu())?;
assert_eq!(sum.item::<f32>(), 12.0);  // Sees GPU's update

How it works:

GPU writes data to unified memory
Memory controller ensures CPU caches are invalidated
CPU reads see the updated data (cache miss → fetch from RAM)
No explicit synchronization needed in user code

Limitations and tradeoffs

Memory bandwidth vs. GPU-only

While 400-800 GB/s is fast, dedicated GPU memory (HBM) can be faster:

NVIDIA H100: 3,350 GB/s (HBM3)
AMD MI300X: 5,300 GB/s (HBM3)

However, this only matters for:

Very large models where most time is memory-bound
Operations that shuffle large amounts of data
Training with large batch sizes

For inference with small-medium models, Apple Silicon’s unified memory wins due to:

Zero PCIe overhead
Larger effective memory capacity
Lower latency

Memory contention

CPU and GPU share the same memory bandwidth:

// If CPU is heavily using memory while GPU runs inference,
// both may experience reduced bandwidth
let cpu_task = std::thread::spawn(|| {
    // Heavy CPU memory access
    process_large_dataset();
});

// GPU inference runs concurrently
let output = model.forward(&input, StreamOrDevice::gpu())?;

cpu_task.join().unwrap();

Mitigation:

Use CPU’s efficiency cores for background tasks
MLX prioritizes GPU memory access
OS scheduler manages contention

No memory overcommit for GPU

Unlike CPU (which can swap to disk), GPU operations fail if memory is exhausted:

use mlx_rs::random;

// This will fail if not enough unified memory available
let huge_array = random::normal(&[100000, 100000], None, None, None)?;

Monitor memory usage:

# macOS Activity Monitor shows unified memory usage
# GPU memory is part of "Memory Used"

Best practices

1. Avoid unnecessary evaluations

Lazy evaluation minimizes memory allocations:

// Good: Single evaluation
let result = ops::add(&a, &b, StreamOrDevice::gpu())?;
let output = ops::mul(&result, &c, StreamOrDevice::gpu())?;
output.eval()?;  // Materializes output, may skip result

// Bad: Multiple evaluations
let result = ops::add(&a, &b, StreamOrDevice::gpu())?;
result.eval()?;  // Allocates result
let output = ops::mul(&result, &c, StreamOrDevice::gpu())?;
output.eval()?;  // Allocates output (result still in memory)

2. Prefer GPU for large operations

GPU is faster for large arrays, CPU for small ones:

// Large matmul: Use GPU
let large = ops::matmul(&x, &w, StreamOrDevice::gpu())?;  // x: [1024, 4096]

// Small operations: CPU can be faster due to GPU dispatch overhead
let small = ops::add(&a, &b, StreamOrDevice::cpu())?;     // a: [4]

3. Batch operations when possible

Reduce overhead by processing multiple items:

// Good: Batch processing
let batch = Array::from_slice(&all_inputs, &[batch_size, seq_len])?;
let outputs = model.forward(&batch, StreamOrDevice::gpu())?;

// Bad: One at a time
for input in inputs {
    let output = model.forward(&input, StreamOrDevice::gpu())?;
    // More overhead from kernel launches
}

4. Monitor memory usage

Check unified memory consumption:

# Terminal: Monitor memory in real-time
while true; do
    vm_stat | grep "Pages free"
    sleep 1
done

# Or use Activity Monitor GUI
# Look at "Memory" tab, "Memory Used"

In code:

// Check array size
let size_bytes = array.size() * array.dtype().size();
println!("Array memory: {} MB", size_bytes / 1_000_000);

Unified memory vs. CUDA Unified Memory

NVIDIA also has a “Unified Memory” feature, but it works differently:

Feature	Apple Unified Memory	CUDA Unified Memory
Hardware	True shared physical memory	Separate CPU/GPU memory
Data transfer	None (same memory)	Automatic copying via PCIe
Overhead	Zero	PCIe latency + bandwidth
Explicit `cudaMemcpy`	Not needed	Optional (automatic or manual)
Performance	Consistent	Depends on access pattern

CUDA Unified Memory is a software abstraction that automatically copies data, while Apple’s is true hardware-level sharing.

Additional resources

MLX framework

How MLX leverages unified memory

Lazy evaluation

Memory optimization through lazy evaluation

Architecture

System architecture and data flow

Performance tips

Optimize inference performance

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

What is unified memory?

Traditional GPU architecture (NVIDIA/AMD)

Apple Silicon unified memory

How MLX leverages unified memory

No explicit device transfers

Device as execution hint

Benefits for ML inference

1. Larger models fit in memory

2. Faster mixed CPU/GPU workloads

3. Efficient attention with large contexts

4. Reduced latency for small batches

Memory bandwidth

Memory coherency

Limitations and tradeoffs

Memory bandwidth vs. GPU-only

Memory contention

No memory overcommit for GPU

Best practices

1. Avoid unnecessary evaluations

2. Prefer GPU for large operations

3. Batch operations when possible

4. Monitor memory usage

Unified memory vs. CUDA Unified Memory

Additional resources

MLX framework

Lazy evaluation

Architecture

Performance tips

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​What is unified memory?

​Traditional GPU architecture (NVIDIA/AMD)

​Apple Silicon unified memory

​How MLX leverages unified memory

​No explicit device transfers

​Device as execution hint

​Benefits for ML inference

​1. Larger models fit in memory

​2. Faster mixed CPU/GPU workloads

​3. Efficient attention with large contexts

​4. Reduced latency for small batches

​Memory bandwidth

​Memory coherency

​Limitations and tradeoffs

​Memory bandwidth vs. GPU-only

​Memory contention

​No memory overcommit for GPU

​Best practices

​1. Avoid unnecessary evaluations

​2. Prefer GPU for large operations

​3. Batch operations when possible

​4. Monitor memory usage

​Unified memory vs. CUDA Unified Memory

​Additional resources

MLX framework

Lazy evaluation

Architecture

Performance tips

Build docs developers (and LLMs) love

What is unified memory?

Traditional GPU architecture (NVIDIA/AMD)

Apple Silicon unified memory

How MLX leverages unified memory

No explicit device transfers

Device as execution hint

Benefits for ML inference

1. Larger models fit in memory

2. Faster mixed CPU/GPU workloads

3. Efficient attention with large contexts

4. Reduced latency for small batches

Memory bandwidth

Memory coherency

Limitations and tradeoffs

Memory bandwidth vs. GPU-only

Memory contention

No memory overcommit for GPU

Best practices

1. Avoid unnecessary evaluations

2. Prefer GPU for large operations

3. Batch operations when possible

4. Monitor memory usage

Unified memory vs. CUDA Unified Memory

Additional resources