Skip to main content
PicHash (Position-Independent Code Hash) provides fast, exact matching of code functions in MCRIT. Unlike MinHash’s approximate similarity, PicHash identifies identical code regardless of where it’s loaded in memory.

What is PicHash?

PicHash is a 64-bit hash of a function’s bytes after normalizing position-dependent elements:
  • Absolute addresses → Escaped/removed
  • Intraprocedural jumps → Normalized to offsets
  • Everything else → Hashed as-is
This creates a fingerprint that’s stable across:
  • Different base addresses (ASLR)
  • Different positions in the binary
  • Different builds (if code unchanged)

PicHash vs PicBlockHash

MCRIT uses two levels of position-independent hashing:

PicHash

Function-level hashComputed from entire function’s bytesOne hash per functionUsed for: Finding identical functions

PicBlockHash

Basic block-level hashComputed for each basic blockMultiple hashes per functionUsed for: Finding code reuse, partial matches

How PicHash Works

Function-Level PicHash

1

Escape Instructions

Use SMDA’s IntelInstructionEscaper to normalize position-dependent bytes:
# Example instruction transformation
call 0x401000 → call <OFFSET>
mov eax, [0x403000] → mov eax, [<ADDR>]
jnz 0x4010A0 → jnz <OFFSET>
2

Concatenate Blocks

Join escaped bytes from all basic blocks in function
3

Hash with SHA-256

Take first 8 bytes of SHA-256 hash:
import hashlib
import struct

pic_hash = struct.unpack("Q", 
    hashlib.sha256(escaped_bytes).digest()[:8]
)[0]
Result: A 64-bit integer that’s stable across position changes. MCRIT automatically computes PicHash for every function during import. It’s stored in the FunctionEntry:
function_entry.pichash  # 64-bit integer
Source: mcrit/storage/FunctionEntry.py:34

Block-Level PicBlockHash

Each basic block gets its own hash, enabling detection of:
  • Code reuse (same block in different functions)
  • Partial matches (some blocks match, others don’t)
  • Unique blocks (blocks appearing in only one family)
Source: mcrit/matchers/FunctionCfgMatcher.py:33

Integration with picblocks Library

MCRIT uses the external picblocks library for some PicHash operations:
from picblocks.blockhasher import BlockHasher

# BlockHasher handles position-independent hashing
# Integrates with SMDA for instruction escaping
The picblocks library provides:
  • BlockHasher - Generates PicBlockHashes
  • YARA rule generation - Creates rules from unique blocks
  • Visualization - Shows block reuse patterns
Source: mcrit/storage/MongoDbStorage.py (import statement)

Use Cases

1. Exact Function Matching

PicHash enables instant lookup of known functions:
# Check if a function is already indexed
matches = index.getMatchesForPicHash(0x7A3B2C1D9E4F5A6B)
# Returns: [(family_id, sample_id, function_id), ...]
Use cases:
  • Library function identification
  • Finding exact code clones
  • Deduplication across samples
PicHash matching is much faster than MinHash (~1000x) because it’s a simple hash table lookup instead of LSH candidate generation.

2. Unique Block Identification

Find basic blocks that appear only in a specific family:
unique_blocks = index.getUniqueBlocks(sample_ids=[1, 2, 3])
# Returns blocks that are unique to these samples
Applications:
  • YARA rule generation - Create signatures for malware families
  • Code attribution - Identify distinctive code patterns
  • Threat hunting - Find specific implementations
Source: mcrit/storage/UniqueBlocksResult.py:26

3. Partial Function Matching

Compare which basic blocks match between two functions:
matcher = FunctionCfgMatcher(sample_a, func_a, sample_b, func_b)
block_matches = matcher.getAllPicblockMatches()
# Returns: {"a": {block_offsets}, "b": {block_offsets}}
Use cases:
  • Understanding code evolution
  • Finding partially modified functions
  • Visualizing code reuse

Query Endpoints

MCRIT’s REST API provides PicHash query endpoints:

Query by PicHash

curl http://localhost:8000/query/pichash/7A3B2C1D9E4F5A6B
Returns all functions with this PicHash:
{
  "status": "successful",
  "data": [
    [1, 100, 5000],  // [family_id, sample_id, function_id]
    [1, 101, 5234],
    [2, 150, 7890]
  ]
}
Source: mcrit/server/QueryResource.py:90

Query by PicBlockHash

curl http://localhost:8000/query/picblockhash/9D4E5F6A7B8C9D0E
Returns all blocks with this hash:
{
  "status": "successful",
  "data": [
    [1, 100, 5000, 0x1000],  // [family_id, sample_id, function_id, offset]
    [1, 101, 5234, 0x1050],
    [3, 200, 9876, 0x2000]
  ]
}
Note: Block hashes include offset because the same function can have multiple blocks with different hashes. Source: mcrit/server/QueryResource.py:133

Summary Endpoints

Get statistics without full match lists:
curl http://localhost:8000/query/pichash/7A3B2C1D9E4F5A6B/summary
{
  "status": "successful",
  "data": {
    "families": 3,
    "samples": 45,
    "functions": 234
  }
}
Useful for quick prevalence checks without transferring large result sets. Source: mcrit/server/QueryResource.py:109

PicHash Storage and Indexing

MCRIT maintains separate indices for PicHash and PicBlockHash:

MongoDB Schema

// FunctionEntry document
{
  function_id: 12345,
  pichash: NumberLong("8845632100997654321"),
  picblockhashes: [
    {offset: 0x1000, hash: NumberLong("..."), size: 15},
    {offset: 0x100F, hash: NumberLong("..."), size: 8}
  ],
  // ... other fields
}

// Separate indices for fast lookup
db.functions.createIndex({pichash: 1})
db.picblockhashes.createIndex({hash: 1})

YARA Rule Generation

MCRIT can generate YARA rules from unique blocks identified via PicBlockHash:
1

Identify Unique Blocks

Find blocks that appear only in target samples
2

Select Representative Blocks

Choose blocks that best cover all samples
3

Generate Rule

Create YARA signatures from block bytes
Source: mcrit/storage/UniqueBlocksResult.py:35

Performance Characteristics

Speed: Very fast (~10,000 functions/second)When: Computed once during function importCost: Minimal - single SHA-256 hash per function
Speed: Extremely fast (~100,000 lookups/second)Method: Direct hash table lookup (O(1))Use case: Checking if a function is already known
Speed: Fast (~5,000 functions/second)When: Computed on-demand or during matchingCost: One SHA-256 hash per basic block
Speed: Moderate (depends on dataset size)Method: Set operations across all blocksUse case: YARA generation, family analysis

Limitations

PicHash is not resilient to:
  • Code modifications (even single instruction changes)
  • Compiler differences
  • Optimization level changes
  • Instruction reordering
For fuzzy matching, use MinHash instead.

PicHash vs MinHash: When to Use Each

ScenarioUse PicHashUse MinHash
Exact library function lookup
Finding code clones
Compiler variation tolerance
Optimization resilience
Speed critical
Initial filtering
Similarity scoring
Partial matchingBlocks only✅ Full function
Best practice: Use PicHash for fast exact matching first, then fall back to MinHash for similarity matching. MCRIT does this automatically.

MinHash

Fuzzy similarity matching for modified code

Architecture

How PicHash fits into MCRIT’s workflow

Further Reading

Build docs developers (and LLMs) love