Engine class is the core component that manages model execution, KV cache, attention backends, and CUDA graph optimization. It handles the low-level details of GPU inference.
Constructor
Engine configuration object containing:
- Model configuration (architecture, vocab size, etc.)
- Tensor parallelism settings
- Memory and performance tuning
- Backend selection (attention, MoE)
- CUDA graph settings
Initialization Process
The engine initialization performs several critical steps:- Communication Setup: Initializes distributed communication (NCCL/Gloo) for tensor parallelism
- Model Loading: Loads model weights from disk or initializes dummy weights
- KV Cache Allocation: Allocates page-based KV cache in GPU memory
- Page Table Creation: Sets up page table for efficient memory management
- Backend Initialization: Configures attention backend (FlashAttention, FlashInfer, TRT-LLM) and MoE backend if needed
- Sampler Setup: Initializes token sampling logic
- CUDA Graph Capture: Pre-records frequently used batch sizes for faster execution
Key Methods
forward_batch()
Executes a forward pass for a batch of requests.Batch object containing:
reqs: List of request objectsphase: Either “prefill” or “decode”input_ids,positions,out_loc: Prepared tensors
Sampling arguments for the batch, including temperature, top-k, top-p per request.
ForwardOutput
A named tuple containing:
next_tokens_gpu: Sampled tokens on GPUnext_tokens_cpu: Sampled tokens copied to CPUcopy_done_event: CUDA event for synchronization
shutdown()
Cleanly shuts down the engine and releases resources.- Destroys CUDA graphs
- Cleans up distributed process groups
- Releases GPU memory
Key Attributes
The loaded language model
KV cache pool managing paged memory
Page table mapping logical to physical KV cache locations
Shape:
(max_running_req + 1, aligned_max_seq_len)Attention backend implementation (FlashAttention, FlashInfer, or TRT-LLM)
MoE backend for mixture-of-experts models (if applicable)
Token sampling module
CUDA graph manager for optimized execution
Global context object containing shared state
Architecture
The engine coordinates several subsystems:Memory Management
The engine automatically determines the number of KV cache pages based on available GPU memory:num_page_override in EngineConfig.
CUDA Graph Optimization
The engine captures CUDA graphs for frequently used batch sizes (specified incuda_graph_bs). This reduces kernel launch overhead:
- If batch size matches a captured graph → replay graph (faster)
- Otherwise → execute model normally
Distributed Execution
The engine supports tensor parallelism for large models:Notes
- The engine must be initialized before CUDA is initialized elsewhere
- Only one engine instance should exist per process
- The engine is not thread-safe; use separate processes for parallelism
- Call
shutdown()to properly clean up resources before process exit - The dummy request (table index =
max_running_req) is used for CUDA graph capture