Levenshtein Distance

Levenshtein distance is the core scoring mechanism in Proteus. It measures how many single-character edits (insertions, deletions, or substitutions) are needed to transform one string into another.

What is Levenshtein Distance?

The Levenshtein distance between two strings is the minimum number of edit operations required to transform one into the other.

Example Calculation

Let’s walk through a simple example:

String A: "Starship confirmed for March"
String B: "Starship is GO for March"

Real Market Example

From the Satya Nadella market (Example 5 in the whitepaper):

Actual Text

Copilot is now generating 46% of all new code at GitHub-connected enterprises. The AI transformation of software is just beginning.

Claude Prediction

Copilot is now generating 45% of all new code at GitHub-connected enterprises. The AI transformation of software is just beginning.

Distance: 1 (single character difference: 5 → 6)

GPT Prediction

Copilot is now generating 43% of all new code at GitHub-connected enterprises. The AI transformation of software has just begun.

Distance: 8 (number difference plus phrasing change)

Winner: Claude at distance 1. The 7-edit gap between Claude and GPT determines the entire pool. In a binary market, both would “predict correctly” and split nothing.

Mathematical Properties

Levenshtein distance is a proper metric on the space of strings. This mathematical property is essential for fair market design.

Metric Properties

Identity

d(a, b) = 0 if and only if a = bPerfect predictions are unambiguously identified.

Symmetry

d(a, b) = d(b, a)Order doesn’t matter - prediction vs actual gives same distance.

Triangle Inequality

d(a, c) ≤ d(a, b) + d(b, c)Ensures coherent distance relationships.

Non-Negativity

d(a, b) ≥ 0 for all stringsDistances are always positive or zero.

Why This Matters

These properties guarantee:

No gaming: You can’t manipulate distances through intermediate strings
Deterministic: Same inputs always produce same distance
Fair: The closest prediction always wins, no ambiguity

On-Chain Implementation

Proteus computes Levenshtein distance entirely on-chain using the Wagner-Fischer dynamic programming algorithm.

Algorithm Overview

PredictionMarketV2.sol:460-503

function levenshteinDistance(
    string memory a,
    string memory b
) public pure returns (uint256) {
    bytes memory bytesA = bytes(a);
    bytes memory bytesB = bytes(b);
    
    uint256 lenA = bytesA.length;
    uint256 lenB = bytesB.length;
    
    // Handle empty string cases
    if (lenA == 0) return lenB;
    if (lenB == 0) return lenA;
    
    // Create distance matrix (two-row optimization)
    uint256[] memory prevRow = new uint256[](lenB + 1);
    uint256[] memory currRow = new uint256[](lenB + 1);
    
    // Initialize first row
    for (uint256 j = 0; j <= lenB; j++) {
        prevRow[j] = j;
    }
    
    // Fill in the matrix
    for (uint256 i = 1; i <= lenA; i++) {
        currRow[0] = i;
        
        for (uint256 j = 1; j <= lenB; j++) {
            uint256 cost = (bytesA[i - 1] == bytesB[j - 1]) ? 0 : 1;
            
            // Minimum of: deletion, insertion, substitution
            uint256 deletion = prevRow[j] + 1;
            uint256 insertion = currRow[j - 1] + 1;
            uint256 substitution = prevRow[j - 1] + cost;
            
            currRow[j] = _min3(deletion, insertion, substitution);
        }
        
        // Swap rows for next iteration
        (prevRow, currRow) = (currRow, prevRow);
    }
    
    return prevRow[lenB];
}

Complexity & Gas Costs

Time Complexity: O(m × n)

Where m and n are the lengths of the two strings. For tweet-length text (280 characters each), this is approximately 78,400 operations.

Space Complexity: O(min(m, n))

The two-row optimization reduces memory usage. We only store two rows of the matrix at a time instead of the full m × n matrix.

Gas Costs on BASE L2

Approximate gas costs from benchmarking:

String Length (each)	Gas Cost
50 characters	~400,000
100 characters	~1,500,000
280 characters	~9,000,000

The contract enforces MAX_TEXT_LENGTH = 280 to prevent block gas limit DoS attacks.

Continuous Gradient Payoff

Unlike binary markets (correct/incorrect), Levenshtein creates a continuous payoff surface.

Distance vs Payout

The winner is determined by:

winner = argmin_{s ∈ submissions} d_L(s.predictedText, actualText)

Every character of precision matters:

Perfect (d = 0)
Near-Perfect (d = 1-5)
High Quality (d = 10-20)
Thematic (d = 50-100)
Random (d ≈ max length)

Exact match. Takes entire pool (93% after fees).Example: __NULL__ prediction when target doesn’t post.

Lipschitz Continuity

The expected payoff is Lipschitz-continuous with respect to prediction quality:

|E[payout | d] − E[payout | d+1]| ≤ 0.93 × pool × |F(d) − F(d+1)|

This means:

Marginal improvements always translate to marginal payout improvements
No “close enough” threshold
Every edit counts

Anti-Bot Properties

Levenshtein distance naturally filters spam and random submissions.

Expected Distance for Random Strings

For random strings over a large alphabet (like printable ASCII with 95 characters):

E[d_L(random_a, random_b)] → max(len(a), len(b))

Why? With 95 characters, the probability of any two random characters matching is only ~1.05%. Random strings achieve near-maximal distance.

Real Example: Bot Entropy

From Tim Cook market (Example 6):

Submitter	Prediction	Distance
AI Roleplay	Coherent Apple Intelligence message	28
Human (thematic)	“Tim will say something about privacy…“	53
Random bot	`x7g APPLE j2m PHONE kq9 BUY...`	65
Degenerate bot	`aaaaaaaaaaaaaaaaaaaaaa...`	73

The metric itself is the spam filter. In a character-level outcome space with a large alphabet, there is no shortcut for random or adversarial submissions. Bots waste their money.

Outcome Space Analysis

Levenshtein distance operates over a combinatorially explosive space.

Binary Market

Outcome space:
Information content: 1 bit

Text Prediction Market

Alphabet: 95 printable ASCII characters
Max length: 280 characters (tweet length)
Outcome space: ≈ 95^280 ≈ 10^554 possibilities
Information content: ≈ 1,840 bits

Information density ratio: 1,840:1Each text prediction market encodes approximately 1,840 times more information than a binary market.

This is a conservative estimate. Even the “likely” region of natural language is astronomically larger than .

Comparison: Levenshtein vs Binary

Binary Markets

Payoff: Cliff function (right/wrong)Space: 2 outcomesAI Resistance: Collapses as models convergeExample: “Will Nadella post about Copilot? Yes/No”Both Claude and GPT say “Yes” → split pool

Levenshtein Markets

Payoff: Continuous gradientSpace: 10^554 outcomesAI Resistance: Deepens as models improveExample: “What will Nadella post about Copilot?”Claude at d=1, GPT at d=8 → 7-edit gap = entire pool

Key Takeaways

Proper Metric

Mathematical guarantees ensure fair, deterministic scoring

Continuous Payoff

Every character of precision is rewarded

Natural Spam Filter

Random strings achieve maximum distance automatically

On-Chain Computation

Wagner-Fischer algorithm runs entirely on BASE

1,840x Information

Massive outcome space vs binary markets

AI-Resistant

Markets deepen as models improve, not commoditize

Market Lifecycle

How markets progress from creation to resolution

Fee Structure

Platform fees and payout distribution

Academic Reference: Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707-710.Wagner-Fischer algorithm: Wagner, R. A., & Fischer, M. J. (1974). The string-to-string correction problem. Journal of the ACM, 21(1), 168-173.

Get Started

Core Concepts

Using Proteus

Smart Contracts

Development

​What is Levenshtein Distance?

​Example Calculation

​Real Market Example

​Mathematical Properties

​Metric Properties

Identity

Symmetry

Triangle Inequality

Non-Negativity

​Why This Matters

​On-Chain Implementation

​Algorithm Overview

​Complexity & Gas Costs

​Continuous Gradient Payoff

​Distance vs Payout

​Lipschitz Continuity

​Anti-Bot Properties

​Expected Distance for Random Strings

​Real Example: Bot Entropy

​Outcome Space Analysis

​Binary Market

​Text Prediction Market

​Comparison: Levenshtein vs Binary

Binary Markets

Levenshtein Markets

​Key Takeaways

Proper Metric

Continuous Payoff

Natural Spam Filter

On-Chain Computation

1,840x Information

AI-Resistant

​Further Reading

Market Lifecycle

Fee Structure

Build docs developers (and LLMs) love

What is Levenshtein Distance?

Example Calculation

Real Market Example

Mathematical Properties

Metric Properties

Why This Matters

On-Chain Implementation

Algorithm Overview

Complexity & Gas Costs

Continuous Gradient Payoff

Distance vs Payout

Lipschitz Continuity

Anti-Bot Properties

Expected Distance for Random Strings

Real Example: Bot Entropy

Outcome Space Analysis

Binary Market

Text Prediction Market

Comparison: Levenshtein vs Binary

Key Takeaways

Further Reading