What is MinHash?
MinHash (Minwise Hashing) is a probabilistic algorithm for estimating the similarity between sets. In MCRIT, these sets are collections of shingles extracted from disassembled functions. The key insight: if two functions have similar MinHash signatures, they likely contain similar code patterns, even across different compilers or optimization levels.MinHash in MCRIT
MCRIT uses MinHash to create compact fingerprints of binary functions that can be compared efficiently.Core Components
MinHash Class
Data Transfer Object storing MinHash signatures
MinHasher Class
Generates MinHash signatures from functions
MinHashIndex
Band-based index for fast candidate retrieval
Scoring
Comparing signatures and determining matches
MinHash Class
TheMinHash class is a DTO (Data Transfer Object) that stores a MinHash signature.
The MinHash signature can be stored as either a byte array (
minhash) or integer list (minhash_int). MCRIT uses 32-bit or 8-bit signatures depending on configuration.Key Methods
calculateMinHashScore(first, second)- Computes similarity between two signatures (0-100%)hashData(data, seed)- Hashes data using MurmurHash3 (mmh3)scoreAgainst(other)- Scores this MinHash against another
mcrit/minhash/MinHash.py:13
MinHasher Class
TheMinHasher generates MinHash signatures from disassembled functions. It orchestrates multiple shinglers and applies one of three strategies.
MinHash Generation Strategies
MCRIT supports three strategies for generating MinHash signatures:- HASH_ALL
- XOR_ALL
- SEGMENTED
Strategy 1: Hash All SeedsFor each hash seed:
- Generate shingles using all configured shinglers
- Take the minimum shingle from each shingler
- Select the global minimum across all shinglers
mcrit/minhash/MinHasher.py:126Configuration
Band-Based Indexing
MCRIT uses Locality-Sensitive Hashing (LSH) with band-based indexing to avoid comparing every function against every other function.How It Works
Divide Signature into Bands
A MinHash signature of length 256 might be divided into 64 bands of 4 hashes each.
The number of bands and rows per band determines the threshold at which similar functions are likely to be found. More bands increase recall but require more storage.
mcrit/index/MinHashIndex.py:61
Scoring and Thresholds
MinHash Score Calculation
The similarity score between two MinHash signatures is the percentage of matching hash values:mcrit/minhash/MinHash.py:84
Matching Threshold
By default, MCRIT considers functions to match if their MinHash score exceeds the configured threshold (typically 50%).Shingler Composition Tracking
MCRIT can track which shinglers contributed to each position in the signature:mcrit/minhash/MinHasher.py:159
Trade-offs: Speed vs Accuracy
Signature Length
Signature Length
Longer signatures (512+ values):
- ✅ More accurate similarity estimates
- ✅ Better discrimination between similar functions
- ❌ Slower to compute and compare
- ❌ More storage required
- ✅ Faster computation and comparison
- ✅ Less storage
- ❌ Less accurate
- ❌ More false positives
Signature Bits
Signature Bits
32-bit hashes:
- ✅ Standard approach, well-tested
- ✅ Extremely low collision rate
- ❌ 4 bytes per hash value
- ✅ 4x less storage
- ✅ Faster to compare
- ❌ Higher collision rate
- ❌ Less accurate for small signatures
Band Configuration
Band Configuration
More bands (128 bands × 2 rows):
- ✅ Better recall (finds more matches)
- ✅ Can find very similar functions reliably
- ❌ More index storage
- ❌ More candidate functions to verify
- ✅ Less storage overhead
- ✅ Fewer candidates to verify
- ❌ May miss marginally similar functions
- ❌ Requires higher similarity to match
MinHash Strategy
MinHash Strategy
HASH_ALL:
- ✅ Most thorough
- ❌ Slowest (10-100x slower)
- ✅ Good balance
- ✅ Recommended for most use cases
- ✅ Fastest
- ✅ Guarantees shingler representation
- ❌ May be less discriminative
Minimum Function Requirements
Not all functions are suitable for MinHashing. MCRIT requires functions to meet minimum thresholds:mcrit/minhash/MinHasher.py:50
Related Concepts
Shinglers
Learn how code features are extracted into shingles
PicHash
Exact matching using position-independent hashing
Architecture
See how MinHash fits into MCRIT’s overall design