Scheduler - tinygrad

Overview

The scheduler is responsible for converting the high-level UOp graph into a linear sequence of executable kernels. It performs critical optimizations including kernel fusion, memory planning, and dependency tracking.

The scheduler is implemented in tinygrad/engine/schedule.py.

Scheduler Responsibilities

The scheduler performs several key tasks:

Graph Partitioning

Breaking large computation graphs into executable kernels

Kernel Fusion

Combining operations into efficient fused kernels

Memory Planning

Optimizing buffer allocation and reuse

Dependency Tracking

Ensuring correct execution order

Scheduling Pipeline

┌─────────────────────────────────────────────┐
│            UOp Graph                         │
│  - High-level tensor operations              │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│            Build Dependency Graph            │
│  - Extract kernel operations                 │
│  - Build producer-consumer relationships     │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│            Topological Sort                  │
│  - Order kernels respecting dependencies     │
│  - Handle multi-device operations            │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│            Linearization                     │
│  - Create linear schedule of ExecItems       │
│  - Assign buffers to operations              │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│            Memory Planning                   │
│  - Allocate and reuse buffers                │
│  - Minimize memory footprint                 │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│            Schedule Output                   │
│  - List of ExecItems ready for execution     │
└─────────────────────────────────────────────┘

ExecItem

The scheduler outputs a list of ExecItem objects, where each represents one kernel to execute:

class ExecItem:
  ast: UOp            # Computation to perform
  bufs: list[Buffer]  # Input/output buffers
  metadata: dict      # Additional execution metadata

ExecItem Creation

From the schedule, ExecItem is created for each kernel:

from tinygrad.engine.schedule import linear_to_schedule

# Convert LINEAR UOp to schedule
schedule = linear_to_schedule(linear_uop)

# Each item in schedule is an ExecItem
for item in schedule:
  item.run()  # Execute the kernel

Kernel Fusion

Kernel fusion is one of tinygrad’s most powerful optimizations, combining multiple operations into single kernels.

Fusion Benefits

Reduced memory traffic - Intermediate results stay in registers
Fewer kernel launches - Lower overhead
Better cache utilization - Improved memory locality
More optimization opportunities - Larger scope for compiler

Fusion Example

Consider this sequence of operations:

from tinygrad import Tensor

x = Tensor.rand(1024, 1024)
y = (x + 1) * 2  # These operations fuse
z = y.relu()     # This also fuses

Without fusion, this would require 3 kernel launches. With fusion, it’s a single kernel:

// Fused kernel (pseudocode)
for (int i = 0; i < N; i++) {
  float val = x[i];
  val = val + 1.0f;      // Add
  val = val * 2.0f;      // Mul
  val = max(val, 0.0f);  // ReLU
  output[i] = val;
}

Fusion Constraints

Not all operations can be fused:

Reduction boundaries - Reductions often require separate kernels
Memory dependencies - Can’t fuse operations that need intermediate materialization
Device limitations - Kernels must fit in device resources
Multi-device ops - Cross-device operations need separate kernels

Dependency Tracking

The scheduler builds a dependency graph to ensure correct execution order.

Dependency Types

RAW

Read After Write

Must read after previous write completes

WAR

Write After Read

Must write after previous read completes

WAW

Write After Write

Must write in correct order

Dependency Graph Construction

The scheduler:

Identifies all kernel operations (CALL, END UOps)
Extracts buffer dependencies from each kernel
Builds producer-consumer edges
Computes in-degree for topological sort

children: dict[UOp, list[UOp]] = {}  # Producer -> consumers
in_degree: dict[UOp, int] = {}       # Consumer -> dependency count

Topological Sort

The scheduler performs a topological sort to linearize the schedule:

# Start with kernels that have no dependencies
queue = deque(k for k, v in in_degree.items() if v == 0)
linearized = []

while queue:
  kernel = queue.popleft()
  linearized.append(kernel)
  
  # Update dependent kernels
  for child in children.get(kernel, []):
    in_degree[child] -= 1
    if in_degree[child] == 0:
      queue.append(child)

Memory Planning

After scheduling, memory planning optimizes buffer allocation:

from tinygrad.engine.memory import memory_planner

# Plan memory allocations
schedule = memory_planner(schedule)

Memory Planning Goals

Buffer reuse - Reuse allocations when buffers are no longer needed
Memory minimization - Reduce peak memory usage
Allocation efficiency - Batch allocations when possible

Memory Planning Algorithm

The planner:

Tracks buffer lifetimes through the schedule
Identifies opportunities for reuse
Assigns physical allocations to logical buffers
Inserts allocation/deallocation operations

Multi-Device Scheduling

For multi-GPU operations, the scheduler handles device coordination:

from tinygrad import Tensor

# Multi-device tensor
x = Tensor.empty(1024, 1024).shard(['GPU:0', 'GPU:1'])
y = x * 2  # Scheduled on both devices

Multi-Device UOps

MSELECT - Select buffer based on device ID
MSTACK - Stack of multi-device buffers

Device Synchronization

The scheduler inserts synchronization when needed:

All-reduce - Collective operations across devices
Device barriers - Ensure operation completion
Data transfers - Move data between devices

Schedule Visualization

Visualize the schedule:

DEBUG=3 python script.py

This shows:

Kernel operations
Buffer dependencies
Execution order
Fusion decisions

Rangeify

The rangeify pass in tinygrad/schedule/rangeify.py handles:

Loop construction - Creating iteration ranges
Index calculation - Computing buffer indices
WAR dependency insertion - Adding write-after-read dependencies

from tinygrad.schedule.rangeify import get_kernel_graph

kernel_graph = get_kernel_graph(uop)

Debugging the Scheduler

View Schedule

DEBUG=3 python script.py

Trace Fusion Decisions

DEBUG=3 python script.py 2>&1 | grep "fused"

Verify Memory Planning

DEBUG=3 python script.py 2>&1 | grep "alloc"

Check Dependencies

Use process replay to verify correctness:

CAPTURE_PROCESS_REPLAY=1 python script.py

Optimization Tips

Keep operations lazy - Don’t call .realize() prematurely to enable more fusion.

Minimize cross-device operations - They prevent fusion and add synchronization overhead.

Use contiguous memory - Contiguous buffers enable better fusion.

Excessive fusion can increase register pressure and reduce occupancy. The scheduler balances fusion with resource constraints.

Schedule Caching

tinygrad caches schedules to avoid recomputation:

from tinygrad.helpers import SCACHE

# Schedules are cached by UOp hash
if schedule_cache_hit:
  return cached_schedule

Advanced Topics

Custom Scheduling

For advanced use cases, you can customize scheduling behavior with environment variables.

JIT Integration

The scheduler integrates with TinyJit to capture and replay schedules:

from tinygrad import TinyJit

@TinyJit
def forward(x):
  return x * 2 + 1

# First call: schedule and execute
forward(x)

# Subsequent calls: replay cached schedule
forward(x)  # Much faster

Contributing

Architecture

Backend Development

​Overview

​Scheduler Responsibilities

Graph Partitioning

Kernel Fusion

Memory Planning

Dependency Tracking

​Scheduling Pipeline

​ExecItem

​ExecItem Creation

​Kernel Fusion

​Fusion Benefits

​Fusion Example

​Fusion Constraints

​Dependency Tracking

​Dependency Types

​Dependency Graph Construction

​Topological Sort

​Memory Planning

​Memory Planning Goals

​Memory Planning Algorithm

​Multi-Device Scheduling

​Multi-Device UOps

​Device Synchronization

​Schedule Visualization

​Rangeify

​Debugging the Scheduler

​View Schedule

​Trace Fusion Decisions

​Verify Memory Planning

​Check Dependencies

​Optimization Tips

​Schedule Caching

​Advanced Topics

​Custom Scheduling

​JIT Integration

​Further Reading

Build docs developers (and LLMs) love

Overview

Scheduler Responsibilities

Scheduling Pipeline

ExecItem

ExecItem Creation

Kernel Fusion

Fusion Benefits

Fusion Example

Fusion Constraints

Dependency Tracking

Dependency Types

Dependency Graph Construction

Topological Sort

Memory Planning

Memory Planning Goals

Memory Planning Algorithm

Multi-Device Scheduling

Multi-Device UOps

Device Synchronization

Schedule Visualization

Rangeify

Debugging the Scheduler

View Schedule

Trace Fusion Decisions

Verify Memory Planning

Check Dependencies

Optimization Tips

Schedule Caching

Advanced Topics

Custom Scheduling

JIT Integration

Further Reading