Lazy evaluation is a strategy where operations are not executed immediately when called. Instead, MLX builds a computation graph and only evaluates it when the results are explicitly needed.
use mlx_rs::{array, ops};// These operations don't execute yet - they build a graphlet a = array!([1.0, 2.0, 3.0]);let b = array!([4.0, 5.0, 6.0]);let c = ops::add(&a, &b)?; // Graph: c = add(a, b)let d = ops::mul(&c, 2.0)?; // Graph: d = mul(add(a, b), 2.0)// Evaluation happens hered.eval()?; // Now MLX executes the optimized graphprintln!("{:?}", d); // [10.0, 14.0, 18.0]
This is similar to how TensorFlow 1.x worked with graph mode, but MLX constructs graphs dynamically like PyTorch, combining the benefits of both approaches.
Multiple operations can be combined into a single GPU kernel, reducing overhead:Without fusion (eager execution):
// Each operation launches a separate kernellet x = ops::add(&a, &b)?; // Kernel 1: x = a + bx.eval()?; // GPU waitlet y = ops::mul(&x, &c)?; // Kernel 2: y = x * c y.eval()?; // GPU waitlet z = ops::relu(&y)?; // Kernel 3: z = relu(y)z.eval()?; // GPU wait// Total: 3 kernel launches + 3 GPU waits
With fusion (lazy evaluation):
// Build graph without executionlet x = ops::add(&a, &b)?;let y = ops::mul(&x, &c)?;let z = ops::relu(&y)?;// Single kernel: z = relu((a + b) * c)z.eval()?; // Total: 1 kernel launch + 1 GPU wait
Kernel fusion reduces:
Kernel launch overhead: ~10-50 microseconds per launch
Memory bandwidth: Intermediate results stay in GPU registers/cache
Global memory writes: x and y never written to memory
let (a, b) = heavy_computation()?; // Both results computed// Only use a, ignore blet c = ops::add(&a, 1.0)?;c.eval()?; // b is never materialized// Memory saved: size of b
Real-world example from transformer models:
// Attention computationlet scores = ops::matmul(&q, &k)?; // [batch, heads, seq, seq]let scaled = ops::div(&scores, scale)?;let masked = ops::add(&scaled, &mask)?;let probs = ops::softmax(&masked, -1)?;let out = ops::matmul(&probs, &v)?;out.eval()?; // Only 'out' materialized, intermediate tensors (scores, scaled, masked, probs) // may be fused or computed on-the-fly
fn process(x: &Array, use_branch_a: bool) -> Array { let branch_a = expensive_computation_a(x)?; // Only built if needed let branch_b = expensive_computation_b(x)?; // Only built if needed if use_branch_a { branch_a // branch_b never evaluated } else { branch_b // branch_a never evaluated }}
See mlx-rs/src/lib.rs:60 for detailed lazy evaluation documentation.
Certain operations automatically trigger evaluation:Accessing array data:
let x = ops::add(&a, &b)?;// These all trigger evaluation:let slice: &[f32] = x.as_slice(); // Access raw datalet value: f32 = x.item(); // Get scalar value println!("{:?}", x); // Display array (calls eval internally)
Saving to disk:
let weights = model.parameters();// Saves trigger evaluation of all parametersArray::save_safetensors("model.safetensors", &weights)?;
Control flow based on array values:
let condition = ops::greater(&x, &threshold)?;// Using as boolean triggers evaluationif condition.item::<bool>() { // eval() called here println!("Threshold exceeded");}
Using arrays for control flow can be inefficient if done frequently. The graph is evaluated at each branch, preventing larger graph optimizations.
use mlx_rs::transforms::eval;let a = ops::add(&x, &y)?;let b = ops::mul(&x, &z)?;let c = ops::sub(&y, &z)?;// Evaluate all three in a single calleval(&[&a, &b, &c])?;// Now all three are materialized
use mlx_rs::transforms::{grad, eval};for (step, batch) in dataset.enumerate() { // Build computation graph (no evaluation) let loss = model.forward(&batch)?; let grads = grad(loss_fn, &[0])(&[&batch])?; optimizer.update(&mut model, grads)?; // Evaluate once per step eval(&[&loss])?; // Also evaluates updated parameters if step % 100 == 0 { println!("Step {}: loss = {:.4}", step, loss.item::<f32>()); }}
Evaluate at the natural iteration boundary (batch/epoch) rather than after every operation.
❌ After every operation: Defeats the purpose of lazy evaluation
// Bad: Too many evaluationslet a = ops::add(&x, &y)?;a.eval()?;let b = ops::mul(&a, &z)?;b.eval()?;let c = ops::relu(&b)?;c.eval()?;// Good: Single evaluationlet c = ops::relu(&ops::mul(&ops::add(&x, &y)?, &z)?)?;c.eval()?;
❌ Inside tight loops: Accumulates overhead
// Badfor i in 0..1000 { let x = ops::add(&a, Array::from_int(i))?; x.eval()?; // 1000 kernel launches}// Good: Vectorizelet indices = Array::from_slice(&(0..1000).collect::<Vec<_>>(), &[1000])?;let results = ops::add(&a, &indices)?;results.eval()?; // 1 kernel launch
❌ For intermediate debugging: Use logging instead
// Badlet a = ops::add(&x, &y)?;a.eval()?; // Just to debugprintln!("DEBUG: a = {:?}", a);let b = ops::mul(&a, &z)?;// Good: Defer debugginglet a = ops::add(&x, &y)?;let b = ops::mul(&a, &z)?;b.eval()?;if log::log_enabled!(log::Level::Debug) { println!("DEBUG: a = {:?}", a); // Already evaluated}
There’s a trade-off between graph size and evaluation overhead:Very frequent evaluation (many small graphs):
❌ High kernel launch overhead
✅ Low graph construction overhead
✅ Low memory usage
Very infrequent evaluation (few large graphs):
✅ Amortized kernel launch overhead
❌ High graph construction overhead
❌ Higher memory usage
Sweet spot: 10s to 1000s of operations per eval()
// Good: ~100 ops per evalfor batch in dataset { // 1000 batches let out = model.forward(&batch)?; // ~100 ops out.eval()?;}// Bad: 1 op per evalfor batch in dataset { for layer in model.layers() { let out = layer.forward(&batch)?; // 1 op out.eval()?; // Too frequent }}// Bad: 100,000 ops per evallet mut acc = initial_value();for i in 0..100_000 { acc = ops::add(&acc, &step)?; // Graph grows huge}acc.eval()?; // Graph construction overhead too high
let x = ops::div(&a, &b)?;println!("Result: {}", x); // This evaluates x!let y = ops::add(&x, 1)?; // x already evaluatedy.eval()?; // Only evaluates y, can't fuse with x
Fix: Avoid printing/accessing until ready:
let x = ops::div(&a, &b)?;let y = ops::add(&x, 1)?;y.eval()?; // Fuses div and addprintln!("Result: {}", y); // Access after eval