Skip to main content
Shinglers are the foundation of MCRIT’s MinHash-based similarity detection. They transform disassembled functions into sets of features (shingles) that can be hashed and compared.

What is a Shingler?

A shingler extracts specific features from a disassembled function and converts them into hashable byte sequences. Each shingler focuses on different aspects of code:
  • Instruction patterns (escaped opcodes and operands)
  • Control flow characteristics (block counts, call patterns)
  • Statistical properties (instruction type distributions)
By combining multiple shinglers, MCRIT captures diverse code characteristics that remain stable across compilation variations.

Shingler Architecture

AbstractShingler Base Class

All shinglers inherit from AbstractShingler, which defines the interface and common functionality.
# From mcrit/shinglers/AbstractShingler.py
class AbstractShingler:
    def __init__(self, plugin_name):
        self._name = plugin_name
        self._config = {}
        self._weight = 0
        self._use_weights = True
    
    @abstractmethod
    def _generateByteSequences(self, function_object):
        """Generate shingles from a function (must implement)"""
        raise NotImplementedError
    
    def process(self, function_object, hash_seed):
        """Hash the byte sequences into numeric shingles"""
        # Calls _generateByteSequences, then hashes results
        pass
Key responsibilities:
  • Define the _generateByteSequences() method to extract features
  • Automatically hash byte sequences using MurmurHash3
  • Support weighted contributions via XOR variants
  • Provide helper methods like _logbucket() for fuzzy bucketing
Source: mcrit/shinglers/AbstractShingler.py:11

Shingler Weights

Each shingler has a weight determining how many signature positions it can influence:
# Example configuration
SHINGLERS_WEIGHTS = {
    "EscapedBlockShingler": 3,
    "FuzzyStatPairShingler": 2,
}
  • Weight of 3 = shingler contributes to more of the signature
  • Weight of 0 = shingler is disabled
  • Higher weight = more influence on similarity scores
Weights are implemented through XOR variants: a shingler with weight 3 generates its base shingles, then creates 2 additional variants by XORing with random values.
Source: mcrit/minhash/ShingleLoader.py:38

ShingleLoader

The ShingleLoader dynamically loads and initializes shinglers based on configuration.
1

Scan Directory

Finds all *Shingler.py files in the shingler directory
2

Import Classes

Dynamically imports each shingler class
3

Apply Weights

Instantiates shinglers according to configured weights
4

Generate XOR Values

Creates random XOR values for weighted variants
# Weight strategies
WEIGHT_STRATEGY_ALL_SHINGLERS_EQUAL = 1  # All active shinglers, weight=1
WEIGHT_STRATEGY_SHINGLER_WEIGHTS = 2     # Use configured weights
Source: mcrit/minhash/ShingleLoader.py:12

Built-in Shinglers

MCRIT includes two primary shinglers in active use:

EscapedBlockShingler

Purpose: Captures instruction sequences with normalized operands
The EscapedBlockShingler extracts instruction patterns from basic blocks using SMDA’s instruction escaping.Key features:
  • Groups mnemonics by category (move, arithmetic, control, etc.)
  • Escapes operands to remove absolute addresses and register specifics
  • Creates n-grams of escaped instructions
  • Filters stack operations (push/pop, esp/rsp)
Why it works: Instruction patterns remain similar across:
  • Different compilers (MSVC, GCC, Clang)
  • Optimization levels
  • Minor code variations
Source: mcrit/shinglers/EscapedBlockShingler.py:9

FuzzyStatPairShingler

Purpose: Captures statistical properties of functions with fuzzy bucketing
The FuzzyStatPairShingler extracts control flow and instruction statistics, using logarithmic bucketing to introduce fuzziness.Extracted statistics:
  • Number of calls (num_calls)
  • Number of specific instruction types (Control, Stack, Arithmetic, Move)
  • Stack frame size
  • Maximum basic block size
  • Strongly connected components (SCCs)
Why it works: Statistical properties are resilient to:
  • Register allocation differences
  • Instruction reordering
  • Minor optimizations
Source: mcrit/shinglers/FuzzyStatPairShingler.py:12

Shingler Processing Pipeline

Step-by-Step Example

1

Function Disassembly

SMDA disassembles a function into blocks and instructions
2

Shingler Execution

Each active shingler processes the function:
  • EscapedBlockShingler → 45 instruction n-grams
  • FuzzyStatPairShingler → 18 statistical shingles
3

Hashing

Each byte sequence is hashed with MurmurHash3:
shingle_hash = mmh3.hash("A v,i;C;C i", seed) & 0xFFFFFFFF
# Result: 0x7A4B3C1D
4

MinHash Generation

MinHasher combines shingles using selected strategy to produce 256-value signature

Advanced: Shingler Configuration

Creating Custom Shinglers

You can implement custom shinglers by extending AbstractShingler:
from AbstractShingler import AbstractShingler

class MyCustomShingler(AbstractShingler):
    def __init__(self, config, weight=1):
        super().__init__(__class__.__name__)
        self._config = config
        self._weight = weight
    
    def _generateByteSequences(self, function_object):
        sequences = []
        # Extract your custom features
        for block in function_object.blocks.values():
            # Example: hash block structure
            sequences.append(f"block_size:{block.length}")
        return sequences
Place your custom shingler in mcrit/shinglers/MyCustomShingler.py and it will be automatically discovered by ShingleLoader.

Configuration Options

# From ShinglerConfig
SHINGLER_DIR = "mcrit/shinglers"
SHINGLER_WEIGHT_STRATEGY = 2  # Use configured weights
SHINGLERS_SEED = 12345
SHINGLERS_XOR_VALUES = []  # Auto-generated

# Fuzzy bucketing for FuzzyStatPairShingler
SHINGLER_LOGBUCKETS = 256
SHINGLER_LOGBUCKET_RANGE = 2  # Create ±2 bucket variants
SHINGLER_LOGBUCKET_CENTERED = True

Archived Shinglers

MCRIT includes many experimental shinglers in mcrit/shinglers/archived/:
  • MnemHistShingler - Mnemonic histograms
  • MnemSeqShingler - Mnemonic sequences
  • NgramShingler - Instruction n-grams
  • TreeBfsShingler - Control flow tree traversal
  • CallgraphStatsShingler - Call graph properties
These are kept for research purposes but not actively used due to performance or accuracy concerns.

Shingler Performance Characteristics

Speed: Fast (processes ~1000 functions/second)Accuracy: High for similar code, resilient to:
  • ✅ Compiler variations
  • ✅ Optimization levels
  • ✅ Register allocation
Weaknesses:
  • ❌ Sensitive to instruction reordering
  • ❌ Can miss semantically equivalent code
Speed: Very fast (processes ~5000 functions/second)Accuracy: Moderate, good for:
  • ✅ Finding functions with similar complexity
  • ✅ Matching across heavy optimizations
  • ✅ Complementing instruction-based shinglers
Weaknesses:
  • ❌ Less discriminative (many functions have similar stats)
  • ❌ Not sufficient alone for accurate matching

Best Practices

Weight Balance

Give instruction-based shinglers (EscapedBlock) higher weight than statistical shinglers

Complementary Features

Combine shinglers that capture different aspects (syntax + statistics)

Testing

Test custom shinglers on diverse sample sets to validate effectiveness

Performance

Monitor shingler execution time - slow shinglers bottleneck analysis

MinHash

How shingles are combined into similarity signatures

Architecture

Where shinglers fit in MCRIT’s pipeline

Build docs developers (and LLMs) love