Overview
Fuzzy matching in Hedis detects modified or obfuscated versions of vulnerable functions that would not match via exact hash comparison. It uses Levenshtein distance to measure string similarity between IR representations.Why Fuzzy Matching?
Exact hash matching (SHA256) requires identical IR strings. Real-world scenarios that break exact matching:- Minification — Variable names shortened or removed
- Bundling optimizations — Dead code elimination, function inlining
- Compiler differences — Different Hermes versions may generate slightly different bytecode
- Intentional obfuscation — String splitting, encoding, or identifier mangling
- Partial patches — Developers manually fix part of a vulnerability without updating the package
Fuzzy matching is only performed when exact matching fails. Exact matches are always preferred for performance and precision.
Levenshtein Distance
The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.Implementation
Frompkg/cmd/analyze.go:260:
Example
Comparing two structural IR strings:Similarity Score
The raw distance is normalized to a similarity score between 0.0 and 1.0:1.0= Identical strings (0 edits needed)0.8= 80% similar (20% of characters need to change)0.5= 50% similar0.0= Completely different strings
Confidence Thresholds
Hedis uses a configurable confidence threshold to determine when a fuzzy match is significant:Default: 0.8 (80% similarity)
This threshold balances precision and recall:- Higher threshold (0.9-0.95) — Fewer false positives, may miss heavily modified code
- Lower threshold (0.6-0.7) — More detections, increased false positive rate
- Recommended: 0.8 — Good balance for most real-world scenarios
Performance Optimizations
1. Length-Based Pre-Filtering
Comparing strings with drastically different lengths is computationally expensive and unlikely to yield matches. Hedis uses a length tolerance filter:- IR string of length 100 → Only compare against strings in range [80, 125]
- Strings with length < 30 → Skip entirely (too short for reliable matching)
2. Document Limits
To prevent exhaustive database scans, fuzzy matching limits the number of candidate documents:pkg/cmd/analyze.go:87:
3. Parallelized Comparison
When comparing package contents, fuzzy matching runs in parallel across the three hash types:Match Results
Fuzzy matches include confidence scores and matched strings:Example Output
- Number of fuzzy matches above threshold
- Number of eligible functions (> 30 chars, not exact matched)
- Percentage of eligible functions matched
Exact matches are excluded from fuzzy matching to avoid double-counting. Only functions that failed exact matching are considered for fuzzy comparison.
Trigram Shingling (Advanced)
For Content IR1 and IR2, Hedis also generates trigram shingles (3-character substrings) during tokenization:"document_picker" → ["doc", "ocu", "cum", "ume", "men", "ent", "nt_", "t_p", "_pi", "pic", "ick", "cke", "ker"]
This enables partial string matching when only fragments of identifiers or literals are preserved.
Related Sections
- Fingerprinting — The three hash types used in matching
- Hermes Bytecode — Understanding the bytecode structure
- Database Schema — MongoDB collections storing fingerprints