Skip to main content
Levenshtein distance is the core scoring mechanism in Proteus. It measures how many single-character edits (insertions, deletions, or substitutions) are needed to transform one string into another.

What is Levenshtein Distance?

The Levenshtein distance between two strings is the minimum number of edit operations required to transform one into the other.

Example Calculation

Let’s walk through a simple example:
String A: "Starship confirmed for March"
String B: "Starship is GO for March"

Real Market Example

From the Satya Nadella market (Example 5 in the whitepaper):
1

Actual Text

Copilot is now generating 46% of all new code at GitHub-connected enterprises. The AI transformation of software is just beginning.
2

Claude Prediction

Copilot is now generating 45% of all new code at GitHub-connected enterprises. The AI transformation of software is just beginning.Distance: 1 (single character difference: 56)
3

GPT Prediction

Copilot is now generating 43% of all new code at GitHub-connected enterprises. The AI transformation of software has just begun.Distance: 8 (number difference plus phrasing change)
Winner: Claude at distance 1. The 7-edit gap between Claude and GPT determines the entire pool. In a binary market, both would “predict correctly” and split nothing.

Mathematical Properties

Levenshtein distance is a proper metric on the space of strings. This mathematical property is essential for fair market design.

Metric Properties

Identity

d(a, b) = 0 if and only if a = bPerfect predictions are unambiguously identified.

Symmetry

d(a, b) = d(b, a)Order doesn’t matter - prediction vs actual gives same distance.

Triangle Inequality

d(a, c) ≤ d(a, b) + d(b, c)Ensures coherent distance relationships.

Non-Negativity

d(a, b) ≥ 0 for all stringsDistances are always positive or zero.

Why This Matters

These properties guarantee:
  • No gaming: You can’t manipulate distances through intermediate strings
  • Deterministic: Same inputs always produce same distance
  • Fair: The closest prediction always wins, no ambiguity

On-Chain Implementation

Proteus computes Levenshtein distance entirely on-chain using the Wagner-Fischer dynamic programming algorithm.

Algorithm Overview

PredictionMarketV2.sol:460-503
function levenshteinDistance(
    string memory a,
    string memory b
) public pure returns (uint256) {
    bytes memory bytesA = bytes(a);
    bytes memory bytesB = bytes(b);
    
    uint256 lenA = bytesA.length;
    uint256 lenB = bytesB.length;
    
    // Handle empty string cases
    if (lenA == 0) return lenB;
    if (lenB == 0) return lenA;
    
    // Create distance matrix (two-row optimization)
    uint256[] memory prevRow = new uint256[](lenB + 1);
    uint256[] memory currRow = new uint256[](lenB + 1);
    
    // Initialize first row
    for (uint256 j = 0; j <= lenB; j++) {
        prevRow[j] = j;
    }
    
    // Fill in the matrix
    for (uint256 i = 1; i <= lenA; i++) {
        currRow[0] = i;
        
        for (uint256 j = 1; j <= lenB; j++) {
            uint256 cost = (bytesA[i - 1] == bytesB[j - 1]) ? 0 : 1;
            
            // Minimum of: deletion, insertion, substitution
            uint256 deletion = prevRow[j] + 1;
            uint256 insertion = currRow[j - 1] + 1;
            uint256 substitution = prevRow[j - 1] + cost;
            
            currRow[j] = _min3(deletion, insertion, substitution);
        }
        
        // Swap rows for next iteration
        (prevRow, currRow) = (currRow, prevRow);
    }
    
    return prevRow[lenB];
}

Complexity & Gas Costs

Where m and n are the lengths of the two strings. For tweet-length text (280 characters each), this is approximately 78,400 operations.
The two-row optimization reduces memory usage. We only store two rows of the matrix at a time instead of the full m × n matrix.
Approximate gas costs from benchmarking:
String Length (each)Gas Cost
50 characters~400,000
100 characters~1,500,000
280 characters~9,000,000
The contract enforces MAX_TEXT_LENGTH = 280 to prevent block gas limit DoS attacks.

Continuous Gradient Payoff

Unlike binary markets (correct/incorrect), Levenshtein creates a continuous payoff surface.

Distance vs Payout

The winner is determined by:
winner = argmin_{s ∈ submissions} d_L(s.predictedText, actualText)
Every character of precision matters:
Exact match. Takes entire pool (93% after fees).Example: __NULL__ prediction when target doesn’t post.

Lipschitz Continuity

The expected payoff is Lipschitz-continuous with respect to prediction quality:
|E[payout | d] − E[payout | d+1]| ≤ 0.93 × pool × |F(d) − F(d+1)|
This means:
  • Marginal improvements always translate to marginal payout improvements
  • No “close enough” threshold
  • Every edit counts

Anti-Bot Properties

Levenshtein distance naturally filters spam and random submissions.

Expected Distance for Random Strings

For random strings over a large alphabet (like printable ASCII with 95 characters):
E[d_L(random_a, random_b)] → max(len(a), len(b))
Why? With 95 characters, the probability of any two random characters matching is only ~1.05%. Random strings achieve near-maximal distance.

Real Example: Bot Entropy

From Tim Cook market (Example 6):
SubmitterPredictionDistance
AI RoleplayCoherent Apple Intelligence message28
Human (thematic)“Tim will say something about privacy…“53
Random botx7g APPLE j2m PHONE kq9 BUY...65
Degenerate botaaaaaaaaaaaaaaaaaaaaaa...73
The metric itself is the spam filter. In a character-level outcome space with a large alphabet, there is no shortcut for random or adversarial submissions. Bots waste their money.

Outcome Space Analysis

Levenshtein distance operates over a combinatorially explosive space.

Binary Market

  • Outcome space:
  • Information content: 1 bit

Text Prediction Market

  • Alphabet: 95 printable ASCII characters
  • Max length: 280 characters (tweet length)
  • Outcome space: ≈ 95^280 ≈ 10^554 possibilities
  • Information content: ≈ 1,840 bits
Information density ratio: 1,840:1Each text prediction market encodes approximately 1,840 times more information than a binary market.
This is a conservative estimate. Even the “likely” region of natural language is astronomically larger than .

Comparison: Levenshtein vs Binary

Binary Markets

Payoff: Cliff function (right/wrong)Space: 2 outcomesAI Resistance: Collapses as models convergeExample: “Will Nadella post about Copilot? Yes/No”Both Claude and GPT say “Yes” → split pool

Levenshtein Markets

Payoff: Continuous gradientSpace: 10^554 outcomesAI Resistance: Deepens as models improveExample: “What will Nadella post about Copilot?”Claude at d=1, GPT at d=8 → 7-edit gap = entire pool

Key Takeaways

Proper Metric

Mathematical guarantees ensure fair, deterministic scoring

Continuous Payoff

Every character of precision is rewarded

Natural Spam Filter

Random strings achieve maximum distance automatically

On-Chain Computation

Wagner-Fischer algorithm runs entirely on BASE

1,840x Information

Massive outcome space vs binary markets

AI-Resistant

Markets deepen as models improve, not commoditize

Further Reading

Market Lifecycle

How markets progress from creation to resolution

Fee Structure

Platform fees and payout distribution

Academic Reference: Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707-710.Wagner-Fischer algorithm: Wagner, R. A., & Fischer, M. J. (1974). The string-to-string correction problem. Journal of the ACM, 21(1), 168-173.

Build docs developers (and LLMs) love