Skip to main content

Overview

The scheduler is responsible for converting the high-level UOp graph into a linear sequence of executable kernels. It performs critical optimizations including kernel fusion, memory planning, and dependency tracking.
The scheduler is implemented in tinygrad/engine/schedule.py.

Scheduler Responsibilities

The scheduler performs several key tasks:

Graph Partitioning

Breaking large computation graphs into executable kernels

Kernel Fusion

Combining operations into efficient fused kernels

Memory Planning

Optimizing buffer allocation and reuse

Dependency Tracking

Ensuring correct execution order

Scheduling Pipeline

┌─────────────────────────────────────────────┐
│            UOp Graph                         │
│  - High-level tensor operations              │
└─────────────────┬───────────────────────────┘


┌─────────────────────────────────────────────┐
│            Build Dependency Graph            │
│  - Extract kernel operations                 │
│  - Build producer-consumer relationships     │
└─────────────────┬───────────────────────────┘


┌─────────────────────────────────────────────┐
│            Topological Sort                  │
│  - Order kernels respecting dependencies     │
│  - Handle multi-device operations            │
└─────────────────┬───────────────────────────┘


┌─────────────────────────────────────────────┐
│            Linearization                     │
│  - Create linear schedule of ExecItems       │
│  - Assign buffers to operations              │
└─────────────────┬───────────────────────────┘


┌─────────────────────────────────────────────┐
│            Memory Planning                   │
│  - Allocate and reuse buffers                │
│  - Minimize memory footprint                 │
└─────────────────┬───────────────────────────┘


┌─────────────────────────────────────────────┐
│            Schedule Output                   │
│  - List of ExecItems ready for execution     │
└─────────────────────────────────────────────┘

ExecItem

The scheduler outputs a list of ExecItem objects, where each represents one kernel to execute:
class ExecItem:
  ast: UOp            # Computation to perform
  bufs: list[Buffer]  # Input/output buffers
  metadata: dict      # Additional execution metadata

ExecItem Creation

From the schedule, ExecItem is created for each kernel:
from tinygrad.engine.schedule import linear_to_schedule

# Convert LINEAR UOp to schedule
schedule = linear_to_schedule(linear_uop)

# Each item in schedule is an ExecItem
for item in schedule:
  item.run()  # Execute the kernel

Kernel Fusion

Kernel fusion is one of tinygrad’s most powerful optimizations, combining multiple operations into single kernels.

Fusion Benefits

  • Reduced memory traffic - Intermediate results stay in registers
  • Fewer kernel launches - Lower overhead
  • Better cache utilization - Improved memory locality
  • More optimization opportunities - Larger scope for compiler

Fusion Example

Consider this sequence of operations:
from tinygrad import Tensor

x = Tensor.rand(1024, 1024)
y = (x + 1) * 2  # These operations fuse
z = y.relu()     # This also fuses
Without fusion, this would require 3 kernel launches. With fusion, it’s a single kernel:
// Fused kernel (pseudocode)
for (int i = 0; i < N; i++) {
  float val = x[i];
  val = val + 1.0f;      // Add
  val = val * 2.0f;      // Mul
  val = max(val, 0.0f);  // ReLU
  output[i] = val;
}

Fusion Constraints

Not all operations can be fused:
  • Reduction boundaries - Reductions often require separate kernels
  • Memory dependencies - Can’t fuse operations that need intermediate materialization
  • Device limitations - Kernels must fit in device resources
  • Multi-device ops - Cross-device operations need separate kernels

Dependency Tracking

The scheduler builds a dependency graph to ensure correct execution order.

Dependency Types

RAW
Read After Write
Must read after previous write completes
WAR
Write After Read
Must write after previous read completes
WAW
Write After Write
Must write in correct order

Dependency Graph Construction

The scheduler:
  1. Identifies all kernel operations (CALL, END UOps)
  2. Extracts buffer dependencies from each kernel
  3. Builds producer-consumer edges
  4. Computes in-degree for topological sort
children: dict[UOp, list[UOp]] = {}  # Producer -> consumers
in_degree: dict[UOp, int] = {}       # Consumer -> dependency count

Topological Sort

The scheduler performs a topological sort to linearize the schedule:
# Start with kernels that have no dependencies
queue = deque(k for k, v in in_degree.items() if v == 0)
linearized = []

while queue:
  kernel = queue.popleft()
  linearized.append(kernel)
  
  # Update dependent kernels
  for child in children.get(kernel, []):
    in_degree[child] -= 1
    if in_degree[child] == 0:
      queue.append(child)

Memory Planning

After scheduling, memory planning optimizes buffer allocation:
from tinygrad.engine.memory import memory_planner

# Plan memory allocations
schedule = memory_planner(schedule)

Memory Planning Goals

  • Buffer reuse - Reuse allocations when buffers are no longer needed
  • Memory minimization - Reduce peak memory usage
  • Allocation efficiency - Batch allocations when possible

Memory Planning Algorithm

The planner:
  1. Tracks buffer lifetimes through the schedule
  2. Identifies opportunities for reuse
  3. Assigns physical allocations to logical buffers
  4. Inserts allocation/deallocation operations

Multi-Device Scheduling

For multi-GPU operations, the scheduler handles device coordination:
from tinygrad import Tensor

# Multi-device tensor
x = Tensor.empty(1024, 1024).shard(['GPU:0', 'GPU:1'])
y = x * 2  # Scheduled on both devices

Multi-Device UOps

  • MSELECT - Select buffer based on device ID
  • MSTACK - Stack of multi-device buffers

Device Synchronization

The scheduler inserts synchronization when needed:
  • All-reduce - Collective operations across devices
  • Device barriers - Ensure operation completion
  • Data transfers - Move data between devices

Schedule Visualization

Visualize the schedule:
DEBUG=3 python script.py
This shows:
  • Kernel operations
  • Buffer dependencies
  • Execution order
  • Fusion decisions

Rangeify

The rangeify pass in tinygrad/schedule/rangeify.py handles:
  • Loop construction - Creating iteration ranges
  • Index calculation - Computing buffer indices
  • WAR dependency insertion - Adding write-after-read dependencies
from tinygrad.schedule.rangeify import get_kernel_graph

kernel_graph = get_kernel_graph(uop)

Debugging the Scheduler

View Schedule

DEBUG=3 python script.py

Trace Fusion Decisions

DEBUG=3 python script.py 2>&1 | grep "fused"

Verify Memory Planning

DEBUG=3 python script.py 2>&1 | grep "alloc"

Check Dependencies

Use process replay to verify correctness:
CAPTURE_PROCESS_REPLAY=1 python script.py

Optimization Tips

Keep operations lazy - Don’t call .realize() prematurely to enable more fusion.
Minimize cross-device operations - They prevent fusion and add synchronization overhead.
Use contiguous memory - Contiguous buffers enable better fusion.
Excessive fusion can increase register pressure and reduce occupancy. The scheduler balances fusion with resource constraints.

Schedule Caching

tinygrad caches schedules to avoid recomputation:
from tinygrad.helpers import SCACHE

# Schedules are cached by UOp hash
if schedule_cache_hit:
  return cached_schedule

Advanced Topics

Custom Scheduling

For advanced use cases, you can customize scheduling behavior with environment variables.

JIT Integration

The scheduler integrates with TinyJit to capture and replay schedules:
from tinygrad import TinyJit

@TinyJit
def forward(x):
  return x * 2 + 1

# First call: schedule and execute
forward(x)

# Subsequent calls: replay cached schedule
forward(x)  # Much faster

Further Reading

Build docs developers (and LLMs) love