Why TOON?
When working with large language models, token efficiency directly impacts:- Cost: Fewer tokens = lower API costs
- Context window: More data fits within model limits
- Processing speed: Smaller payloads parse faster
Format structure
TOON v2.0 is currently used in production. The format supports metadata blocks, schema definitions, sample rows, and arrays of structured objects.
- Headers: Define field names in curly braces
{field1,field2,field3} - Optional array size:
[N]indicates array length (e.g.,notebooks[5]) - Data rows: Comma-separated values matching header fields
- Blocks: Named sections for complex nested data
Basic example
Compare this data in JSON vs TOON:Real-world example
Here’s how KaggleIngest encodes competition data:Data types and encoding
TOON supports all common data types with automatic type inference:| Type | Example | Encoding |
|---|---|---|
| Null | null | null |
| Boolean | true, false | true, false |
| Integer | 42 | 42 |
| Float | 3.14 | 3.14 |
| String | Hello | Hello (unquoted if safe) |
| String (special) | true, 2024-01-01 | "true" (quoted when needed) |
| Array | [1, 2, 3] | [1, 2, 3] |
| Object | {a: 1, b: 2} | {a: 1, b: 2} |
Implementation reference
TOON encoding and decoding is handled byToonEncoder and ToonDecoder classes.
Encoding Python data
Decoding TOON to Python
Converting between formats
Validation and error handling
The TOON encoder includes built-in validation:- Headers follow
section{key,key}format - Data rows match header column counts
- Structural integrity of blocks
See
backend/core/toon_encoder.py:345 for the complete validation implementation.Advanced features
Array size annotations
TOON v2.0 includes optional array size hints for efficient memory allocation:[10] indicates this block contains 10 entries, allowing parsers to pre-allocate memory.
Nested structures
TOON supports nested objects and arrays within cells:Multi-block documents
Large datasets are organized into logical blocks:Command-line tools
TOON encoder includes CLI utilities for validation and conversion:Performance characteristics
| Operation | Performance |
|---|---|
| Encoding | O(n) where n = data size |
| Decoding | O(n) with single-pass parsing |
| Token reduction | 30-60% vs JSON |
| Memory overhead | Minimal (streaming capable) |
Why not use Protocol Buffers or MessagePack?
Why not use Protocol Buffers or MessagePack?
Binary formats like Protobuf and MessagePack are more compact but not human-readable or LLM-friendly. LLMs work with text tokens, so a text-based format optimized for tokenization provides the best balance of:
- Token efficiency for LLM context windows
- Human readability for debugging
- Zero external dependencies (pure Python)
- Direct integration with text-based AI workflows
Related resources
- Implementation:
backend/core/toon_encoder.py - Usage in API responses:
backend/services/notebook_service.py:501 - Format tests:
backend/tests/test_toon_encoder.py