Skip to main content
The Engine class is the core component that manages model execution, KV cache, attention backends, and CUDA graph optimization. It handles the low-level details of GPU inference.

Constructor

from minisgl.engine import Engine, EngineConfig

engine = Engine(config: EngineConfig)
config
EngineConfig
required
Engine configuration object containing:
  • Model configuration (architecture, vocab size, etc.)
  • Tensor parallelism settings
  • Memory and performance tuning
  • Backend selection (attention, MoE)
  • CUDA graph settings

Initialization Process

The engine initialization performs several critical steps:
  1. Communication Setup: Initializes distributed communication (NCCL/Gloo) for tensor parallelism
  2. Model Loading: Loads model weights from disk or initializes dummy weights
  3. KV Cache Allocation: Allocates page-based KV cache in GPU memory
  4. Page Table Creation: Sets up page table for efficient memory management
  5. Backend Initialization: Configures attention backend (FlashAttention, FlashInfer, TRT-LLM) and MoE backend if needed
  6. Sampler Setup: Initializes token sampling logic
  7. CUDA Graph Capture: Pre-records frequently used batch sizes for faster execution
from minisgl.engine import Engine, EngineConfig
from minisgl.distributed import DistributedInfo
import torch

config = EngineConfig(
    model_path="meta-llama/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16,
    tp_info=DistributedInfo(rank=0, size=1),
    page_size=16,
    max_running_req=128,
)

engine = Engine(config)

Key Methods

forward_batch()

Executes a forward pass for a batch of requests.
batch
Batch
required
Batch object containing:
  • reqs: List of request objects
  • phase: Either “prefill” or “decode”
  • input_ids, positions, out_loc: Prepared tensors
args
BatchSamplingArgs
required
Sampling arguments for the batch, including temperature, top-k, top-p per request.
Returns: ForwardOutput A named tuple containing:
  • next_tokens_gpu: Sampled tokens on GPU
  • next_tokens_cpu: Sampled tokens copied to CPU
  • copy_done_event: CUDA event for synchronization
from minisgl.core import Batch
from minisgl.engine.sample import BatchSamplingArgs

# Prepared by scheduler
batch = Batch(reqs=requests, phase="prefill")
sampling_args = BatchSamplingArgs(...)

output = engine.forward_batch(batch, sampling_args)

# Wait for CPU copy to complete
output.copy_done_event.synchronize()
next_tokens = output.next_tokens_cpu

shutdown()

Cleanly shuts down the engine and releases resources.
engine.shutdown()
This method:
  • Destroys CUDA graphs
  • Cleans up distributed process groups
  • Releases GPU memory

Key Attributes

model
nn.Module
The loaded language model
kv_cache
BaseKVCachePool
KV cache pool managing paged memory
page_table
torch.Tensor
Page table mapping logical to physical KV cache locations Shape: (max_running_req + 1, aligned_max_seq_len)
attn_backend
BaseAttnBackend
Attention backend implementation (FlashAttention, FlashInfer, or TRT-LLM)
moe_backend
BaseMoeBackend
MoE backend for mixture-of-experts models (if applicable)
sampler
Sampler
Token sampling module
graph_runner
GraphRunner
CUDA graph manager for optimized execution
ctx
Context
Global context object containing shared state

Architecture

The engine coordinates several subsystems:
┌─────────────────────────────────────┐
│           Engine                    │
├─────────────────────────────────────┤
│  Model (Transformer)                │
│  ├─ Loaded from disk                │
│  └─ Distributed across TP ranks     │
├─────────────────────────────────────┤
│  KV Cache Pool                      │
│  ├─ Paged memory management         │
│  └─ Page table for address mapping  │
├─────────────────────────────────────┤
│  Attention Backend                  │
│  ├─ FlashAttention (sm90+)          │
│  ├─ FlashInfer (fallback)           │
│  └─ TRT-LLM (sm100+)                │
├─────────────────────────────────────┤
│  Sampler                            │
│  └─ Top-k, top-p, temperature       │
├─────────────────────────────────────┤
│  CUDA Graph Runner                  │
│  └─ Captures common batch sizes     │
└─────────────────────────────────────┘

Memory Management

The engine automatically determines the number of KV cache pages based on available GPU memory:
num_pages = (available_memory * memory_ratio - model_size) / bytes_per_page
You can override this with num_page_override in EngineConfig.

CUDA Graph Optimization

The engine captures CUDA graphs for frequently used batch sizes (specified in cuda_graph_bs). This reduces kernel launch overhead:
config = EngineConfig(
    ...,
    cuda_graph_bs=[1, 2, 4, 8, 16, 32],  # Capture these batch sizes
    cuda_graph_max_bs=64,  # Maximum batch size for graph
)
During execution:
  • If batch size matches a captured graph → replay graph (faster)
  • Otherwise → execute model normally

Distributed Execution

The engine supports tensor parallelism for large models:
# Rank 0
config = EngineConfig(
    ...,
    tp_info=DistributedInfo(rank=0, size=4),
    distributed_addr="tcp://localhost:12345"
)

# Rank 1, 2, 3 similar with different ranks

Notes

  • The engine must be initialized before CUDA is initialized elsewhere
  • Only one engine instance should exist per process
  • The engine is not thread-safe; use separate processes for parallelism
  • Call shutdown() to properly clean up resources before process exit
  • The dummy request (table index = max_running_req) is used for CUDA graph capture

Build docs developers (and LLMs) love