Skip to main content

LLM Accuracy Report

Comprehensive analysis of how different serialization formats affect LLM accuracy across retrieval, generation, and embedding tasks.

Executive Summary

Retrieval

100% accuracyGLYPH matches JSON on large models

Generation

11% valid rateJSON dominates at 100%

Embeddings

0% penaltyWith semantic projection

Key Findings

MetricWinnerGLYPH ResultNotes
Retrieval AccuracyJSON/GLYPH100% (tied)Large models handle all formats well
Generation QualityJSON11% validLLMs trained primarily on JSON
Embedding SimilarityALL EQUAL0.54 (same)Format irrelevant with semantic projection
Token EfficiencyGLYPH-48% tokensSignificant cost savings
Best BalanceGLYPH-100% accuracy, 48% smaller, no RAG penalty
Critical Discovery: The original benchmark showed GLYPH had 13% lower embedding similarity. This was a bug - we were embedding raw wire format instead of semantic content.

Retrieval Accuracy

By Model Size

CodecCorrectTotalAccuracy
JSON192095.0%
GLYPH192095.0%
GLYPH+Pool192095.0%
ZON182090.0%
TOON192095.0%
GLYPH achieves 100% accuracy on large models (24B+ parameters), matching JSON performance while saving 48% tokens.

By Question Type

Question TypeJSONGLYPHZONTOON
Direct lookup100%100%100%100%
Nested access100%100%100%100%
Boolean values100%100%95%*100%
Counting90%95%85%90%
Aggregation100%100%100%100%
*ZON uses T/F for booleans which smaller models sometimes misinterpret. GLYPH uses t/f with better results.

Key Observations

  1. Larger models handle all formats well - mistral-small:24b achieves 100% on JSON, TOON, and GLYPH
  2. Counting is the hardest task - All models struggle with “how many X have Y” questions
  3. GLYPH’s tabular format helps - The @tab format makes array data clearer to LLMs
  4. Boolean syntax matters - t/f (GLYPH) > T/F (ZON) for smaller models

Generation Quality

For LLM-generated output, use JSON. GLYPH is optimized for LLM consumption, not production.

Results Across All Models

CodecParsedValidSuccess RateNotes
JSON100%100%100%Native format, always validates
ZON100%0%0%Parses but fails schema validation
TOON67%33%33%YAML-like confusion
GLYPH78%11%11%Parses but often wrong types
GLYPH+Pool56%0%0%Pool syntax confuses models

Analysis

  1. Training data bias - LLMs trained extensively on JSON
  2. Syntax familiarity - {"key": "value"} is deeply ingrained
  3. Tooling integration - JSON validators built into prompts
  4. Error patterns - Models “know” valid JSON structure
  1. Simple syntax helps - key=value is learnable from examples
  2. Nested structures fail - Array/object nesting unreliable
  3. No training data - Models never saw GLYPH during training
  4. Pool references break - LLMs don’t understand ^S1:3 syntax

Recommendations

For LLM-Generated Output

Use JSON for reliability⚠️ Use GLYPH only with:
  • Clear examples in prompt
  • Error handling/retry logic
  • Simple flat structures
Never use GLYPH+Pool for generation

For LLM-Consumed Input

Use GLYPH for efficiencyBenefits:
  • 48% fewer tokens
  • 100% retrieval accuracy
  • Better for context windows
  • Human-readable logs

Embedding Similarity (RAG)

Critical Insight: Never embed wire format directly. Always use semantic projection.

Wire vs Semantic Comparison

CodecWire (naive)Semantic (correct)Difference
JSON0.58350.5407-7.3%
GLYPH0.53200.5407+1.6%
ZON0.55110.5407-1.9%
TOON0.58350.5407-7.3%
With semantic projection, ALL codecs achieve identical 0.5407 similarity.

The Bug

// This causes the "13% penalty" bug
const wireFormat = "{name=Alice age=28}";
const embedding = await embed(wireFormat);
// Result: Lower similarity due to syntax differences

Semantic Projection

Wire format (different tokens):
GLYPH
employees=@tab{id,name,department,salary,remote}
1,"John Doe",Engineering,95000,t
JSON
{"employees":[{"id":1,"name":"John Doe","department":"Engineering","salary":95000,"remote":true}]}
Semantic view (identical tokens):
Both formats produce
employees.[array of 5 items]
employees.[0].id: 1
employees.[0].name: "John Doe"
employees.[0].department: "Engineering"
employees.[0].salary: 95000
employees.[0].remote: true

Correct RAG Architecture

1

Storage (compact)

Store data in GLYPH wire format for 48% size reduction
data.json → GLYPH encode → blob → CID
2

Index (semantic)

Generate semantic projection before embedding
data.json → semantic_view() → embed → vector_db
Link: vector_id → CID
3

Query (retrieve)

Fetch GLYPH via CID after vector search
"find employees" → embed → vector_search → CID
                → fetch GLYPH → decode → display
Result: GLYPH compression with ZERO RAG accuracy loss.

Size Comparison

Benefits of GLYPH for LLM Context

DatasetJSON tokensGLYPH tokensSavings
Simple~30~20-33%
Nested~90~50-44%
Tabular~200~70-65%
Complex~190~95-50%
Average---48%

Context Window Math

Agent trace with 50 steps

GPT-4-turbo (128K context):
- Can fit: 8 full traces
- Cost: $0.155 per trace (input)

Recommendations

Architecture Pattern (Hybrid)

Store: GLYPH wire → CID (compact, canonical)Index: Semantic view → embeddings (format-independent)Query: Embed query → vector search → fetch GLYPH via CIDGenerate: Ask LLM for JSON (reliable output)

Use GLYPH Wire Format When

  • ✅ Token budget is constrained (long conversations)
  • ✅ Data is tabular/repetitive (logs, events)
  • ✅ LLM will read but not generate
  • ✅ Storage efficiency matters (48% smaller)
  • ✅ You need canonical CID-addressable blobs
  • ✅ Context window optimization critical

Use JSON When

  • ✅ LLM needs to generate structured output
  • ✅ Interoperability with external systems
  • ✅ You don’t control the embedding pipeline
  • ✅ Maximum compatibility required

Use Semantic Projection When

  • ✅ Building RAG / vector search indexes
  • ✅ Want format-independent embeddings
  • ✅ Storing GLYPH but need good retrieval
  • ✅ Multiple wire formats in same system

Implementation Example

Semantic Projection Function

/**
 * Creates a semantically-rich text view for embedding.
 * Ensures embeddings see same semantics regardless of wire format.
 */
function createSemanticView(data, prefix = '') {
  const lines = [];
  
  if (Array.isArray(data)) {
    lines.push(`${prefix}[array of ${data.length} items]`);
    data.forEach((item, i) => {
      if (typeof item === 'object' && item !== null) {
        lines.push(...createSemanticView(item, `${prefix}[${i}].`));
      } else {
        lines.push(`${prefix}[${i}]: ${formatValue(item)}`);
      }
    });
  } else if (typeof data === 'object' && data !== null) {
    for (const [key, value] of Object.entries(data)) {
      const fullKey = prefix ? `${prefix}${key}` : key;
      if (typeof value === 'object' && value !== null) {
        lines.push(...createSemanticView(value, `${fullKey}.`));
      } else {
        lines.push(`${fullKey}: ${formatValue(value)}`);
      }
    }
  }
  
  return lines;
}

function formatValue(value) {
  if (typeof value === 'boolean') return value ? 'true' : 'false';
  if (typeof value === 'string') return `"${value}"`;
  return String(value);
}

// Usage:
const data = { user: { name: "Alice", active: true } };
const semanticText = createSemanticView(data).join('\n');
// Output:
// user.name: "Alice"
// user.active: true

const embedding = await embed(semanticText);

Test Methodology

Models Tested

ModelParametersType
llama3.2:3b3BSmall, general purpose
qwen3:8b8BMedium, instruction-tuned
mistral-small:24b24BLarge, high capability

Embedding Model

  • nomic-embed-text - 768-dim embeddings for semantic similarity

Test Categories

  1. Direct lookup - “What is the user’s name?”
  2. Nested access - “What is the user’s email?”
  3. Boolean values - “Is the account verified?”
  4. Counting - “How many users have admin role?”
  5. Aggregation - “What is the average salary?”

Reproduction

cd sjson/benchmark/comparison/js

# Quick test (2 datasets, 3 codecs)
node codec_llm_accuracy_bench.mjs --quick --model=llama3.2:3b

# Full test (all datasets, all codecs)
node codec_llm_accuracy_bench.mjs --model=qwen3:8b

# With different model
node codec_llm_accuracy_bench.mjs --model=mistral-small:24b

The Key Insight

Compression and Embedding Quality Are Orthogonal

When you:
  1. Store/transport the compact wire format (GLYPH)
  2. Embed a canonical semantic projection (key: value lines)
  3. Link them via CID
You get the best of both worlds:
  • 48% size reduction (tokens, bytes)
  • Identical RAG accuracy (embeddings)
  • 100% retrieval accuracy (LLM understanding)

Benchmark Results

Full codec comparison across all metrics

Performance Report

Parser speed and optimization details

Build docs developers (and LLMs) love