Engine class provides efficient inference for NanoChat models using KV caching and supports advanced features like tool use and multi-sample generation.
Overview
The engine is designed for maximum efficiency:- KV Cache: Stores key-value pairs from previous tokens to avoid recomputation
- Streaming Generation: Yields tokens one at a time for real-time output
- Batch Generation: Generate multiple samples in parallel
- Tool Use: Built-in calculator tool with automatic result injection
Basic Usage
Generation Methods
Streaming Generation
generate(tokens, num_samples=1, max_tokens=None, temperature=1.0, top_k=None, seed=42)
Streaming generator that yields tokens one at a time.
Parameters:
tokens(list[int]): Input token sequencenum_samples(int): Number of parallel samples to generate (default: 1)max_tokens(int): Maximum tokens to generate (default: None = unlimited)temperature(float): Sampling temperature, 0.0 = greedy (default: 1.0)top_k(int): Top-k sampling parameter (default: None)seed(int): Random seed (default: 42)
token_column(list[int]): Next token for each sample (length = num_samples)token_masks(list[int]): 1 if sampled, 0 if forced by tool (length = num_samples)
Batch Generation
generate_batch(tokens, num_samples=1, **kwargs)
Non-streaming batch generation that returns complete token sequences.
Returns:
results(list[list[int]]): Token sequences for each samplemasks(list[list[int]]): Mask sequences (1=sampled, 0=forced)
KV Cache
The KV cache stores key-value pairs from attention layers to avoid recomputing them for previous tokens.Architecture
Fromnanochat/engine.py:83-133:
Key Methods
reset(): Reset cache to empty stateget_pos(): Get current position (assumes all batch elements at same position)get_layer_cache(layer_idx): Return (k_cache, v_cache) views for a specific layeradvance(num_tokens): Advance the cache position by num_tokensprefill(other): Copy cached KV from another cache (used for multi-sample generation)
Prefill-then-Decode Pattern
The engine uses an efficient two-phase approach:- Prefill: Process the entire prompt in batch=1
- Decode: Clone the KV cache for each sample and generate in parallel
nanochat/engine.py:194-218:
Token Sampling
The engine uses a custom sampling function that supports temperature and top-k sampling. Fromnanochat/engine.py:135-152:
temperature=0.0: Greedy decoding (always pick most likely token)temperature=1.0: Standard sampling from full distributiontemperature>1.0: More random/creative (flattens distribution)temperature<1.0: More focused/deterministic (sharpens distribution)top_k: Only sample from top-k most likely tokens
Tool Use: Calculator
The engine includes built-in support for a calculator tool. When the model generates special tokens, the engine automatically evaluates expressions and injects results.How It Works
- Model generates
<|python_start|>token - Engine enters “python block” mode and accumulates tokens
- Model generates
<|python_end|>token - Engine evaluates the expression using
use_calculator() - If successful, engine forces
<|output_start|>+ result +<|output_end|>tokens - Model continues generation with the result in context
nanochat/engine.py:251-267:
Supported Expressions
The calculator supports:- Math expressions:
2+2,3.14*10,100/5 - String operations:
"hello".count("l"),"world".count("o")
- Timeout after 3 seconds
- No access to builtins or dangerous operations
- Disallows power operator
** - Sanitizes input to prevent code injection
nanochat/engine.py:47-80:
Row State Tracking
When generating multiple samples in parallel, the engine maintains per-row state to track tool use independently. Fromnanochat/engine.py:155-162:
current_tokens: Full token historyforced_tokens: Queue of tokens to inject (from tool results)in_python_block: Whether currently inside<|python_start|>…<|python_end|>python_expr_tokens: Accumulated expression tokenscompleted: Whether generation has ended for this sample
Performance Testing
The engine includes a built-in test to verify correctness and benchmark performance.nanochat/engine.py:302-357: