Fuzzy Matching

Overview

Fuzzy matching in Hedis detects modified or obfuscated versions of vulnerable functions that would not match via exact hash comparison. It uses Levenshtein distance to measure string similarity between IR representations.

Why Fuzzy Matching?

Exact hash matching (SHA256) requires identical IR strings. Real-world scenarios that break exact matching:

Minification — Variable names shortened or removed
Bundling optimizations — Dead code elimination, function inlining
Compiler differences — Different Hermes versions may generate slightly different bytecode
Intentional obfuscation — String splitting, encoding, or identifier mangling
Partial patches — Developers manually fix part of a vulnerability without updating the package

Fuzzy matching catches these cases by measuring how similar two IR strings are, not just whether they’re identical.

Fuzzy matching is only performed when exact matching fails. Exact matches are always preferred for performance and precision.

Levenshtein Distance

The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

Implementation

From pkg/cmd/analyze.go:260:

func levenshteinDistance(s1, s2 string) int {
    if len(s1) == 0 { return len(s2) }
    if len(s2) == 0 { return len(s1) }
    
    // Create dynamic programming matrix
    matrix := make([][]int, len(s1)+1)
    for i := range matrix {
        matrix[i] = make([]int, len(s2)+1)
    }
    
    // Initialize first row and column
    for i := 0; i <= len(s1); i++ { matrix[i][0] = i }
    for j := 0; j <= len(s2); j++ { matrix[0][j] = j }
    
    // Fill matrix using dynamic programming
    for i := 1; i <= len(s1); i++ {
        for j := 1; j <= len(s2); j++ {
            cost := 0
            if s1[i-1] != s2[j-1] { cost = 1 }
            
            matrix[i][j] = min(
                matrix[i-1][j]+1,      // deletion
                matrix[i][j-1]+1,      // insertion
                matrix[i-1][j-1]+cost, // substitution
            )
        }
    }
    
    return matrix[len(s1)][len(s2)]
}

Example

Comparing two structural IR strings:

IR1: "pc=2|LoadParam|GetById|Ret|"
IR2: "pc=2|LoadParam|GetById|JStrictNotEqual|Ret|"

Levenshtein distance = 21 (insertion of “JStrictNotEqual|“)

Similarity Score

The raw distance is normalized to a similarity score between 0.0 and 1.0:

func levenshteinSimilarity(s1, s2 string) float64 {
    if len(s1) == 0 && len(s2) == 0 { return 1.0 }
    if len(s1) == 0 || len(s2) == 0 { return 0.0 }
    
    distance := levenshteinDistance(s1, s2)
    maxLen := max(len(s1), len(s2))
    return 1.0 - (float64(distance) / float64(maxLen))
}

Interpretation:

1.0 = Identical strings (0 edits needed)
0.8 = 80% similar (20% of characters need to change)
0.5 = 50% similar
0.0 = Completely different strings

Confidence Thresholds

Hedis uses a configurable confidence threshold to determine when a fuzzy match is significant:

hedis analyze --hbc-file app.hbc \
  --fuzzy-matching \
  --confidence-threshold 0.8

Default: 0.8 (80% similarity)

This threshold balances precision and recall:

Higher threshold (0.9-0.95) — Fewer false positives, may miss heavily modified code
Lower threshold (0.6-0.7) — More detections, increased false positive rate
Recommended: 0.8 — Good balance for most real-world scenarios

For security-critical analysis, start with 0.8 and manually review fuzzy matches. For rapid screening, use 0.7 to cast a wider net.

Performance Optimizations

1. Length-Based Pre-Filtering

Comparing strings with drastically different lengths is computationally expensive and unlikely to yield matches. Hedis uses a length tolerance filter:

func isLengthCompatible(s1, s2 string, tolerance float64) bool {
    len1, len2 := float64(len(s1)), float64(len(s2))
    minLen, maxLen := len1, len2
    if len2 < len1 { minLen, maxLen = len2, len1 }
    
    // Check if difference is within tolerance
    return (maxLen - minLen) / maxLen <= tolerance
}

Default tolerance: 0.2 (±20%) Example:

IR string of length 100 → Only compare against strings in range [80, 125]
Strings with length < 30 → Skip entirely (too short for reliable matching)

2. Document Limits

To prevent exhaustive database scans, fuzzy matching limits the number of candidate documents:

--max-fuzzy-docs 5000  # Default: check up to 5000 documents per input IR

From pkg/cmd/analyze.go:87:

maxFuzzyDocs    int     = 5000  // Maximum documents to check
lengthTolerance float64 = 0.2   // ±20% length tolerance

3. Parallelized Comparison

When comparing package contents, fuzzy matching runs in parallel across the three hash types:

var wg sync.WaitGroup
wg.Add(3)

go fuzzyMatch(packageContent.StructuralIRs, 
              appBundleContent.StructuralIRs, 
              structuralExactIndices, 
              results.Structural, &wg)

go fuzzyMatch(packageContent.ContentIR1s, 
              appBundleContent.ContentIR1s, 
              contentIR1ExactIndices, 
              results.ContentIR1, &wg)

go fuzzyMatch(packageContent.ContentIR2s, 
              appBundleContent.ContentIR2s, 
              contentIR2ExactIndices, 
              results.ContentIR2, &wg)

wg.Wait()

Match Results

Fuzzy matches include confidence scores and matched strings:

type SimilarityMatch struct {
    HashMatch
    Confidence    float64 `json:"confidence"`     // 0.0 to 1.0
    MatchType     string  `json:"match_type"`     // "exact" or "fuzzy"
    MatchedString string  `json:"matched_string"` // Database IR string
    InputString   string  `json:"input_string"`   // Input IR string
}

Example Output

Fuzzy IR matches (≥0.80 similarity):
  • Structural : 142/385 (36.88%)
  • Content IR1: 28/156 (17.95%)
  • Content IR2: 89/201 (44.28%)
  ⇒ Overall    : 259/742 (34.91%)

Each match shows:

Number of fuzzy matches above threshold
Number of eligible functions (> 30 chars, not exact matched)
Percentage of eligible functions matched

Exact matches are excluded from fuzzy matching to avoid double-counting. Only functions that failed exact matching are considered for fuzzy comparison.

Trigram Shingling (Advanced)

For Content IR1 and IR2, Hedis also generates trigram shingles (3-character substrings) during tokenization:

if len(richDataValue) >= 3 {
    for i := 0; i < len(richDataValue)-3; i++ {
        tokens = append(tokens, richDataValue[i:i+3])
    }
}

Example: "document_picker" → ["doc", "ocu", "cum", "ume", "men", "ent", "nt_", "t_p", "_pi", "pic", "ick", "cke", "ker"] This enables partial string matching when only fragments of identifiers or literals are preserved.

Fingerprinting — The three hash types used in matching
Hermes Bytecode — Understanding the bytecode structure
Database Schema — MongoDB collections storing fingerprints

Get Started

Core Concepts

CLI Commands

Guides

Architecture

Fuzzy Matching

Overview

Why Fuzzy Matching?

Levenshtein Distance

Implementation

Example

Similarity Score

Confidence Thresholds

Default: 0.8 (80% similarity)

Performance Optimizations

1. Length-Based Pre-Filtering

2. Document Limits

3. Parallelized Comparison

Match Results

Example Output

Trigram Shingling (Advanced)

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Commands

Guides

Architecture

​Overview

​Why Fuzzy Matching?

​Levenshtein Distance

​Implementation

​Example

​Similarity Score

​Confidence Thresholds

​Default: 0.8 (80% similarity)

​Performance Optimizations

​1. Length-Based Pre-Filtering

​2. Document Limits

​3. Parallelized Comparison

​Match Results

​Example Output

​Trigram Shingling (Advanced)

​Related Sections

Build docs developers (and LLMs) love

Overview

Why Fuzzy Matching?

Levenshtein Distance

Implementation

Example

Similarity Score

Confidence Thresholds

Default: 0.8 (80% similarity)

Performance Optimizations

1. Length-Based Pre-Filtering

2. Document Limits

3. Parallelized Comparison

Match Results

Example Output

Trigram Shingling (Advanced)

Related Sections