LLM Accuracy Report
Comprehensive analysis of how different serialization formats affect LLM accuracy across retrieval, generation, and embedding tasks.Executive Summary
Retrieval
100% accuracyGLYPH matches JSON on large models
Generation
11% valid rateJSON dominates at 100%
Embeddings
0% penaltyWith semantic projection
Key Findings
| Metric | Winner | GLYPH Result | Notes |
|---|---|---|---|
| Retrieval Accuracy | JSON/GLYPH | 100% (tied) | Large models handle all formats well |
| Generation Quality | JSON | 11% valid | LLMs trained primarily on JSON |
| Embedding Similarity | ALL EQUAL | 0.54 (same) | Format irrelevant with semantic projection |
| Token Efficiency | GLYPH | -48% tokens | Significant cost savings |
| Best Balance | GLYPH | - | 100% accuracy, 48% smaller, no RAG penalty |
Retrieval Accuracy
By Model Size
- llama3.2:3b
- qwen3:8b
- mistral-small:24b
| Codec | Correct | Total | Accuracy |
|---|---|---|---|
| JSON | 19 | 20 | 95.0% |
| GLYPH | 19 | 20 | 95.0% |
| GLYPH+Pool | 19 | 20 | 95.0% |
| ZON | 18 | 20 | 90.0% |
| TOON | 19 | 20 | 95.0% |
GLYPH achieves 100% accuracy on large models (24B+ parameters), matching JSON performance while saving 48% tokens.
By Question Type
| Question Type | JSON | GLYPH | ZON | TOON |
|---|---|---|---|---|
| Direct lookup | 100% | 100% | 100% | 100% |
| Nested access | 100% | 100% | 100% | 100% |
| Boolean values | 100% | 100% | 95%* | 100% |
| Counting | 90% | 95% | 85% | 90% |
| Aggregation | 100% | 100% | 100% | 100% |
*ZON uses
T/F for booleans which smaller models sometimes misinterpret. GLYPH uses t/f with better results.Key Observations
- Larger models handle all formats well - mistral-small:24b achieves 100% on JSON, TOON, and GLYPH
- Counting is the hardest task - All models struggle with “how many X have Y” questions
- GLYPH’s tabular format helps - The
@tabformat makes array data clearer to LLMs - Boolean syntax matters -
t/f(GLYPH) >T/F(ZON) for smaller models
Generation Quality
Results Across All Models
| Codec | Parsed | Valid | Success Rate | Notes |
|---|---|---|---|---|
| JSON | 100% | 100% | 100% | Native format, always validates |
| ZON | 100% | 0% | 0% | Parses but fails schema validation |
| TOON | 67% | 33% | 33% | YAML-like confusion |
| GLYPH | 78% | 11% | 11% | Parses but often wrong types |
| GLYPH+Pool | 56% | 0% | 0% | Pool syntax confuses models |
Analysis
Why JSON dominates generation
Why JSON dominates generation
- Training data bias - LLMs trained extensively on JSON
- Syntax familiarity -
{"key": "value"}is deeply ingrained - Tooling integration - JSON validators built into prompts
- Error patterns - Models “know” valid JSON structure
Why GLYPH struggles (but still works sometimes)
Why GLYPH struggles (but still works sometimes)
- Simple syntax helps -
key=valueis learnable from examples - Nested structures fail - Array/object nesting unreliable
- No training data - Models never saw GLYPH during training
- Pool references break - LLMs don’t understand
^S1:3syntax
Recommendations
For LLM-Generated Output
✅ Use JSON for reliability⚠️ Use GLYPH only with:
- Clear examples in prompt
- Error handling/retry logic
- Simple flat structures
For LLM-Consumed Input
✅ Use GLYPH for efficiencyBenefits:
- 48% fewer tokens
- 100% retrieval accuracy
- Better for context windows
- Human-readable logs
Embedding Similarity (RAG)
Critical Insight: Never embed wire format directly. Always use semantic projection.
Wire vs Semantic Comparison
| Codec | Wire (naive) | Semantic (correct) | Difference |
|---|---|---|---|
| JSON | 0.5835 | 0.5407 | -7.3% |
| GLYPH | 0.5320 | 0.5407 | +1.6% |
| ZON | 0.5511 | 0.5407 | -1.9% |
| TOON | 0.5835 | 0.5407 | -7.3% |
With semantic projection, ALL codecs achieve identical 0.5407 similarity.
The Bug
Semantic Projection
Wire format (different tokens):GLYPH
JSON
Both formats produce
Correct RAG Architecture
Result: GLYPH compression with ZERO RAG accuracy loss.
Size Comparison
Benefits of GLYPH for LLM Context
| Dataset | JSON tokens | GLYPH tokens | Savings |
|---|---|---|---|
| Simple | ~30 | ~20 | -33% |
| Nested | ~90 | ~50 | -44% |
| Tabular | ~200 | ~70 | -65% |
| Complex | ~190 | ~95 | -50% |
| Average | - | - | -48% |
Context Window Math
Recommendations
Architecture Pattern (Hybrid)
Store: GLYPH wire → CID (compact, canonical)Index: Semantic view → embeddings (format-independent)Query: Embed query → vector search → fetch GLYPH via CIDGenerate: Ask LLM for JSON (reliable output)
Use GLYPH Wire Format When
- ✅ Token budget is constrained (long conversations)
- ✅ Data is tabular/repetitive (logs, events)
- ✅ LLM will read but not generate
- ✅ Storage efficiency matters (48% smaller)
- ✅ You need canonical CID-addressable blobs
- ✅ Context window optimization critical
Use JSON When
- ✅ LLM needs to generate structured output
- ✅ Interoperability with external systems
- ✅ You don’t control the embedding pipeline
- ✅ Maximum compatibility required
Use Semantic Projection When
- ✅ Building RAG / vector search indexes
- ✅ Want format-independent embeddings
- ✅ Storing GLYPH but need good retrieval
- ✅ Multiple wire formats in same system
Implementation Example
Semantic Projection Function
Test Methodology
Models Tested
| Model | Parameters | Type |
|---|---|---|
| llama3.2:3b | 3B | Small, general purpose |
| qwen3:8b | 8B | Medium, instruction-tuned |
| mistral-small:24b | 24B | Large, high capability |
Embedding Model
nomic-embed-text- 768-dim embeddings for semantic similarity
Test Categories
- Direct lookup - “What is the user’s name?”
- Nested access - “What is the user’s email?”
- Boolean values - “Is the account verified?”
- Counting - “How many users have admin role?”
- Aggregation - “What is the average salary?”
Reproduction
The Key Insight
Compression and Embedding Quality Are Orthogonal
When you:
- Store/transport the compact wire format (GLYPH)
- Embed a canonical semantic projection (key: value lines)
- Link them via CID
- 48% size reduction (tokens, bytes)
- Identical RAG accuracy (embeddings)
- 100% retrieval accuracy (LLM understanding)
Related Documentation
Benchmark Results
Full codec comparison across all metrics
Performance Report
Parser speed and optimization details