What is a Shingler?
A shingler extracts specific features from a disassembled function and converts them into hashable byte sequences. Each shingler focuses on different aspects of code:- Instruction patterns (escaped opcodes and operands)
- Control flow characteristics (block counts, call patterns)
- Statistical properties (instruction type distributions)
Shingler Architecture
AbstractShingler Base Class
All shinglers inherit fromAbstractShingler, which defines the interface and common functionality.
- Define the
_generateByteSequences()method to extract features - Automatically hash byte sequences using MurmurHash3
- Support weighted contributions via XOR variants
- Provide helper methods like
_logbucket()for fuzzy bucketing
mcrit/shinglers/AbstractShingler.py:11
Shingler Weights
Each shingler has a weight determining how many signature positions it can influence:- Weight of 3 = shingler contributes to more of the signature
- Weight of 0 = shingler is disabled
- Higher weight = more influence on similarity scores
Weights are implemented through XOR variants: a shingler with weight 3 generates its base shingles, then creates 2 additional variants by XORing with random values.
mcrit/minhash/ShingleLoader.py:38
ShingleLoader
TheShingleLoader dynamically loads and initializes shinglers based on configuration.
mcrit/minhash/ShingleLoader.py:12
Built-in Shinglers
MCRIT includes two primary shinglers in active use:EscapedBlockShingler
Purpose: Captures instruction sequences with normalized operands- Overview
- Example
- Implementation
The
EscapedBlockShingler extracts instruction patterns from basic blocks using SMDA’s instruction escaping.Key features:- Groups mnemonics by category (move, arithmetic, control, etc.)
- Escapes operands to remove absolute addresses and register specifics
- Creates n-grams of escaped instructions
- Filters stack operations (push/pop, esp/rsp)
- Different compilers (MSVC, GCC, Clang)
- Optimization levels
- Minor code variations
mcrit/shinglers/EscapedBlockShingler.py:9
FuzzyStatPairShingler
Purpose: Captures statistical properties of functions with fuzzy bucketing- Overview
- Log Bucketing
- Generated Shingles
The
FuzzyStatPairShingler extracts control flow and instruction statistics, using logarithmic bucketing to introduce fuzziness.Extracted statistics:- Number of calls (
num_calls) - Number of specific instruction types (Control, Stack, Arithmetic, Move)
- Stack frame size
- Maximum basic block size
- Strongly connected components (SCCs)
- Register allocation differences
- Instruction reordering
- Minor optimizations
mcrit/shinglers/FuzzyStatPairShingler.py:12
Shingler Processing Pipeline
Step-by-Step Example
Shingler Execution
Each active shingler processes the function:
- EscapedBlockShingler → 45 instruction n-grams
- FuzzyStatPairShingler → 18 statistical shingles
Advanced: Shingler Configuration
Creating Custom Shinglers
You can implement custom shinglers by extendingAbstractShingler:
Place your custom shingler in
mcrit/shinglers/MyCustomShingler.py and it will be automatically discovered by ShingleLoader.Configuration Options
Archived Shinglers
MCRIT includes many experimental shinglers inmcrit/shinglers/archived/:
- MnemHistShingler - Mnemonic histograms
- MnemSeqShingler - Mnemonic sequences
- NgramShingler - Instruction n-grams
- TreeBfsShingler - Control flow tree traversal
- CallgraphStatsShingler - Call graph properties
Shingler Performance Characteristics
EscapedBlockShingler
EscapedBlockShingler
Speed: Fast (processes ~1000 functions/second)Accuracy: High for similar code, resilient to:
- ✅ Compiler variations
- ✅ Optimization levels
- ✅ Register allocation
- ❌ Sensitive to instruction reordering
- ❌ Can miss semantically equivalent code
FuzzyStatPairShingler
FuzzyStatPairShingler
Speed: Very fast (processes ~5000 functions/second)Accuracy: Moderate, good for:
- ✅ Finding functions with similar complexity
- ✅ Matching across heavy optimizations
- ✅ Complementing instruction-based shinglers
- ❌ Less discriminative (many functions have similar stats)
- ❌ Not sufficient alone for accurate matching
Best Practices
Weight Balance
Give instruction-based shinglers (EscapedBlock) higher weight than statistical shinglers
Complementary Features
Combine shinglers that capture different aspects (syntax + statistics)
Testing
Test custom shinglers on diverse sample sets to validate effectiveness
Performance
Monitor shingler execution time - slow shinglers bottleneck analysis
Related Concepts
MinHash
How shingles are combined into similarity signatures
Architecture
Where shinglers fit in MCRIT’s pipeline