Fingerprinting

Overview

Fingerprinting is the process of computing a deterministic SHA-256 hash of a GLYPH value’s canonical form. This hash serves as a cryptographic fingerprint of the state.

What is a Fingerprint?

fingerprint = sha256( canonicalize(value) )

Properties:

Deterministic: Same data → same hash (across all languages)
Collision-resistant: Different data → different hash (with overwhelming probability)
Compact: 64 hex characters (256 bits)
Cross-language: Go, Python, JS, Rust produce identical hashes

Why Fingerprinting?

State Verification

Detect state divergence and corruption

Optimistic Concurrency

Prevent lost updates in distributed systems

Cache Keys

Stable keys for LLM response caching

Deduplication

Identify duplicate documents or messages

Computing Fingerprints

Basic Usage

import glyph

data = {"user": "alice", "count": 42}
hash = glyph.fingerprint_loose(data)
print(hash)
# sha256:a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890

# Short form (first 16 chars)
short_hash = hash[7:23]  # Skip 'sha256:' prefix
print(short_hash)
# a1b2c3d4e5f67890

How It Works

Canonicalize the value using Loose mode rules:
- Sort map keys bytewise
- Use deterministic float formatting
- Apply bare-string rules
- Normalize whitespace
Hash the canonical UTF-8 bytes using SHA-256
Format as sha256:<64 hex chars>

Example:

Input:  {"b":1,"a":2}
Canonical: {a=2 b=1}
SHA-256: a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890
Result: sha256:a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890

Use Cases

State Verification

Detect when state has changed unexpectedly.

import glyph

# Save checkpoint
state = {"count": 5, "status": "active"}
checkpoint = {
    "state": state,
    "hash": glyph.fingerprint_loose(state),
    "timestamp": now(),
}
save_to_disk(checkpoint)

# Load and verify
loaded = load_from_disk()
expected_hash = loaded["hash"]
actual_hash = glyph.fingerprint_loose(loaded["state"])

if actual_hash == expected_hash:
    print("✓ Checkpoint integrity verified")
    resume(loaded["state"])
else:
    print("✗ Checkpoint corrupted!")
    raise CorruptionError()

Optimistic Concurrency Control

Prevent lost updates when multiple agents modify shared state.

import glyph
from glyph import stream

# Agent A reads state
state = {"count": 5}
base_hash = glyph.fingerprint_loose(state)
# base_hash: sha256:abc123...

# Agent A creates update
patch = glyph.patch([("~", "count", 1)])

# Agent A sends patch with base hash
writer.write_frame(
    sid=1,
    seq=5,
    kind="patch",
    payload=patch,
    base=base_hash[:16],  # First 16 chars: abc123...
)

# Server receives patch
@handler.on_patch
def handle_patch(sid, seq, payload, state, base):
    # Verify base hash
    current_hash = glyph.fingerprint_loose(state.value)
    if not current_hash.startswith(base):
        # State changed since Agent A read it
        raise BaseMismatchError("State diverged")
    
    # Safe to apply
    patch = glyph.parse_patch(payload)
    new_state = glyph.apply_patch(state.value, patch)
    return new_state

Why this prevents lost updates:

Scenario: Two agents update same state concurrently

Agent A reads: {count=5}  hash=abc123...
Agent B reads: {count=5}  hash=abc123...
Agent A sends: base=abc123... ✓ Applied → {count=6} hash=def456...
Agent B sends: base=abc123... ✗ Rejected (base != def456...)
Agent B retries: reads {count=6} hash=def456...
Agent B sends: base=def456... ✓ Applied → {count=7}

Cache Keys for LLM Responses

Generate stable cache keys for LLM prompts and responses.

import glyph
import redis

redis_client = redis.Redis()

def cached_llm_call(prompt: dict, model: str) -> str:
    """Call LLM with caching."""
    
    # Create cache key from prompt fingerprint
    cache_key_data = {
        "model": model,
        "prompt": prompt,
        "version": "v2",
    }
    cache_key = glyph.fingerprint_loose(cache_key_data)
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        print("Cache hit!")
        return cached.decode()
    
    # Call LLM
    print("Cache miss, calling LLM...")
    response = llm.generate(prompt)
    
    # Store in cache (24h TTL)
    redis_client.setex(cache_key, 86400, response)
    
    return response

# Usage
prompt = {
    "system": "You are a helpful assistant.",
    "user": "What is the capital of France?",
}

response = cached_llm_call(prompt, model="gpt-4")
# First call: Cache miss, calling LLM...

response = cached_llm_call(prompt, model="gpt-4")
# Second call: Cache hit!

Document Deduplication

Identify duplicate documents in a corpus.

import glyph
from collections import defaultdict

def find_duplicates(documents: list[dict]) -> dict[str, list[int]]:
    """Find duplicate documents by fingerprint."""
    
    fingerprints = defaultdict(list)
    
    for i, doc in enumerate(documents):
        fp = glyph.fingerprint_loose(doc)
        fingerprints[fp].append(i)
    
    # Return only duplicates (fingerprints with 2+ docs)
    duplicates = {fp: indices for fp, indices in fingerprints.items() if len(indices) > 1}
    
    return duplicates

# Usage
docs = [
    {"title": "Intro", "content": "Hello world"},
    {"title": "Guide", "content": "How to use"},
    {"title": "Intro", "content": "Hello world"},  # Duplicate of doc 0
    {"title": "FAQ", "content": "Questions"},
]

dupes = find_duplicates(docs)
print(f"Found {len(dupes)} sets of duplicates")
for fp, indices in dupes.items():
    print(f"  Documents {indices} are identical (hash: {fp[:16]}...)")

# Output:
# Found 1 sets of duplicates
#   Documents [0, 2] are identical (hash: sha256:a1b2c3d4...)

Agent State Sync

Sync state across distributed agent processes.

import glyph
from glyph import stream

class AgentStateSyncer:
    def __init__(self, writer, reader):
        self.writer = writer
        self.reader = reader
        self.local_state = {}
        self.local_hash = None
    
    def update_local(self, changes: dict):
        """Update local state and broadcast patch."""
        
        # Compute patch
        patch_ops = [("=", k, v) for k, v in changes.items()]
        patch = glyph.patch(patch_ops)
        
        # Send with current state hash
        self.writer.write_frame(
            sid=1,
            seq=self.next_seq(),
            kind="patch",
            payload=patch,
            base=self.local_hash[:16] if self.local_hash else None,
        )
        
        # Apply locally
        self.local_state.update(changes)
        self.local_hash = glyph.fingerprint_loose(self.local_state)
    
    def sync_from_remote(self, frame):
        """Sync state from remote agent."""
        
        if frame.kind == "patch":
            # Verify base hash
            if frame.base and self.local_hash:
                if not self.local_hash.startswith(frame.base):
                    # State diverged, request full sync
                    self.request_full_state()
                    return
            
            # Apply patch
            patch = glyph.parse_patch(frame.payload)
            self.local_state = glyph.apply_patch(self.local_state, patch)
            self.local_hash = glyph.fingerprint_loose(self.local_state)
        
        elif frame.kind == "doc":
            # Full state sync
            self.local_state = glyph.parse(frame.payload)
            self.local_hash = glyph.fingerprint_loose(self.local_state)

Implementation Details

Canonicalization Mode

Sender and receiver MUST agree on canonicalization mode (Strict vs Loose).

Loose mode (most common): Schema-optional, JSON-compatible
Strict mode: Schema-required, packed encoding

Mixing modes produces different hashes for the same logical data.

Short Hashes

For space efficiency, use the first 16 hex characters (64 bits) of the hash:

full_hash = "sha256:a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890"
short_hash = full_hash[7:23]  # "a1b2c3d4e5f67890"

Collision probability with 64-bit prefix:

1 million hashes: ~0.00003% chance of collision
1 billion hashes: ~27% chance of collision

Use full 256-bit hash for high-security applications.

Cross-Language Consistency

GLYPH guarantees byte-identical canonical forms across Go, Python, JavaScript, and Rust. Test case:

# Python
data = {"user": "alice", "count": 42}
hash_py = glyph.fingerprint_loose(data)
# sha256:a1b2c3d4...

// Go
data := map[string]interface{}{"user": "alice", "count": 42}
hashGo := glyph.FingerprintLoose(glyph.FromJSONLoose(data))
// sha256:a1b2c3d4...

// TypeScript
const data = {user: 'alice', count: 42};
const hashTS = fingerprintLoose(data);
// sha256:a1b2c3d4...

Result: hash_py === hashGo === hashTS ✓

Best Practices

Always Verify on Critical Paths

For checkpoints, distributed state, or financial data, always verify fingerprints before applying changes.

Use Short Hashes for Efficiency

For logs, debug output, or low-risk scenarios, use 16-char short hashes to save space.

Version Your Cache Keys

Include a version field in cache key data to invalidate caches after schema changes.

cache_key_data = {
    "prompt": prompt,
    "model": model,
    "version": "v2",  # Bump to invalidate old caches
}

Log Hashes for Debugging

Include state hashes in logs to trace state evolution:

logger.info(f"State updated: {short_hash} -> {new_short_hash}")

Next Steps

Patches

Use fingerprints for safe patch application

GS1 Streaming

Leverage base hashes in streaming protocol

Loose Mode

Understand canonical form rules

Agent Patterns

Apply fingerprinting in agent systems

Get Started

Core Concepts

Guides

Language SDKs

Advanced

Fingerprinting

Overview

What is a Fingerprint?

Why Fingerprinting?

State Verification

Optimistic Concurrency

Cache Keys

Deduplication

Computing Fingerprints

Basic Usage

How It Works

Use Cases

State Verification

Optimistic Concurrency Control

Cache Keys for LLM Responses

Document Deduplication

Agent State Sync

Implementation Details

Canonicalization Mode

Short Hashes

Cross-Language Consistency

Best Practices

Next Steps

Patches

GS1 Streaming

Loose Mode

Agent Patterns

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Language SDKs

Advanced

​Overview

​What is a Fingerprint?

​Why Fingerprinting?

State Verification

Optimistic Concurrency

Cache Keys

Deduplication

​Computing Fingerprints

​Basic Usage

​How It Works

​Use Cases

​State Verification

​Optimistic Concurrency Control

​Cache Keys for LLM Responses

​Document Deduplication

​Agent State Sync

​Implementation Details

​Canonicalization Mode

​Short Hashes

​Cross-Language Consistency

​Best Practices

​Next Steps

Patches

GS1 Streaming

Loose Mode

Agent Patterns

Build docs developers (and LLMs) love

Overview

What is a Fingerprint?

Why Fingerprinting?

Computing Fingerprints

Basic Usage

How It Works

Use Cases

State Verification

Optimistic Concurrency Control

Cache Keys for LLM Responses

Document Deduplication

Agent State Sync

Implementation Details

Canonicalization Mode

Short Hashes

Cross-Language Consistency

Best Practices

Next Steps