System Architecture

MCRIT is a distributed system designed for large-scale binary code analysis. This page explains the architecture, components, and how data flows through the system.

Architecture Overview

MCRIT follows a client-server architecture with asynchronous job processing:

Core Components

Server Component (REST API)

The server provides a REST API built with Falcon, a high-performance Python web framework.

FamilyResource

Manage malware families/families, /family/{id}

SampleResource

Import and manage samples/samples, /sample/{id}

FunctionResource

Query function details/functions, /function/{id}

QueryResource

Query by binary or PicHash/query/*

MatchResource

Get matching results/matches/*

JobResource

Monitor job status/jobs, /job/{id}

Key characteristics:

Stateless - No session management
Asynchronous - Long operations return job IDs
REST-compliant - Standard HTTP methods and status codes

Source: mcrit/server/application_routes.py

MinHashIndex

The MinHashIndex is the main entry point for all analysis operations. It coordinates between the API, storage, and workers.

Responsibilities
Key Methods
Delegation Pattern

Sample Management - Import/delete samples
Query Coordination - Route queries to workers
Result Retrieval - Fetch completed job results
Index Maintenance - Rebuild indices, cleanup
Configuration - Manage thresholds and settings

class MinHashIndex:
    # Sample operations
    def addReport(self, smda_report, calculate_hashes=True)
    def deleteSample(self, sample_id)
    
    # Query operations (delegated to workers)
    def getMatchesForSmdaReport(self, report_json)
    def getMatchesForUnmappedBinary(self, binary)
    def getMatchesForSample(self, sample_id)
    
    # PicHash lookups (direct)
    def getMatchesForPicHash(self, pichash)
    def getMatchesForPicBlockHash(self, picblockhash)
    
    # Index management
    def rebuildIndex()
    def updateMinHashes(self, function_ids)

MinHashIndex uses the QueueRemoteCaller pattern to delegate expensive operations to workers:

class MinHashIndex(QueueRemoteCaller(Worker)):
    # Methods decorated with @Remote are executed by workers
    pass

When you call a delegated method, MinHashIndex:

Creates a job in the queue
Returns a job ID immediately
A worker picks up the job asynchronously
Results are stored when complete

Source: mcrit/index/MinHashIndex.py:61

Worker Components

MCRIT supports three worker types for different deployment scenarios:

Worker (Base Class)

Standard worker that processes jobs synchronously in its own context.

from mcrit.Worker import Worker

with Worker() as worker:
    worker.start()  # Process jobs from queue

Characteristics:

One job at a time
Keeps state in memory
Good for: Docker deployments, simple setups

Source: mcrit/Worker.py:50

SpawningWorker

Spawns a new process (SingleJobWorker) for each job, providing isolation.

from mcrit.SpawningWorker import SpawningWorker

with SpawningWorker() as worker:
    worker.start()  # Spawns child processes for jobs

Characteristics:

Spawns subprocess for each job
Better fault isolation
Memory cleanup between jobs
Good for: Long-running services, memory leak protection

Implementation:

def _executeJobPayload(self, job_payload, job):
    # Spawn a new process
    console_handle = subprocess.Popen(
        ["python", "-m", "mcrit", "singlejobworker", "--job_id", str(job.job_id)],
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE
    )
    stdout_result, stderr_result = console_handle.communicate(
        timeout=QUEUE_SPAWNINGWORKER_CHILDREN_TIMEOUT
    )

Source: mcrit/SpawningWorker.py:51

SingleJobWorker

Processes exactly one job then exits. Used by SpawningWorker or for manual job processing.

from mcrit.SingleJobWorker import SingleJobWorker

worker = SingleJobWorker(job_id=12345)
worker.run()  # Process this specific job, then exit

Characteristics:

One job, then terminate
Clean process per job
Good for: Spawned by SpawningWorker, debugging specific jobs

Source: mcrit/SingleJobWorker.py:52

Storage Layer (MongoDB)

MCRIT uses MongoDB for all persistent storage with specialized indices for performance.

Collections

families

Malware family metadata

{
  family_id: 1,
  family_name: "Emotet",
  num_samples: 456,
  num_functions: 12890,
  is_library: false
}

samples

Binary sample metadata

{
  sample_id: 100,
  family_id: 1,
  sha256: "a3f5e8b2...",
  filename: "emotet.exe",
  architecture: "intel",
  bitness: 32,
  base_addr: 0x400000,
  statistics: {
    num_functions: 234,
    num_instructions: 45678
  }
}

functions

Function entries with MinHash signatures

{
  function_id: 5000,
  sample_id: 100,
  family_id: 1,
  offset: 0x401000,
  function_name: "sub_401000",
  pichash: NumberLong("8845632100997654321"),
  minhash: BinData(0, "..."),  // 256-byte signature
  num_instructions: 45,
  num_blocks: 8
}

Indices:

{function_id: 1} - Primary key
{sample_id: 1} - Functions by sample
{pichash: 1} - Fast PicHash lookup

bands

MinHash band index for LSH

{
  band_id: "hash_of_band_values",
  band_number: 15,
  function_ids: [5000, 5123, 7890, ...]
}

Each function’s signature is split into bands (e.g., 64 bands × 4 values). Functions sharing a band hash are candidates.Index:

{band_id: 1, band_number: 1} - Fast candidate lookup

picblockhashes

Basic block hash index

{
  hash: NumberLong("9876543210123456789"),
  family_id: 1,
  sample_id: 100,
  function_id: 5000,
  offset: 0x401010
}

Index:

{hash: 1} - Fast block lookup

jobs

Asynchronous job tracking

{
  job_id: "a3f5e8b2-c1d4-e6f7-a9b0-123456789abc",
  method: "getMatchesForSmdaReport",
  state: "finished",
  parameters: {...},
  result_id: "result_sha256",
  created_at: ISODate("2026-03-04T10:30:00Z"),
  started_at: ISODate("2026-03-04T10:30:05Z"),
  finished_at: ISODate("2026-03-04T10:35:23Z")
}

Source: mcrit/storage/MongoDbStorage.py

Band Index Structure

The band index is critical for MinHash performance:

# Example: 256-value signature, 64 bands, 4 rows per band
for band_num in range(64):
    band_values = minhash.minhash_int[band_num*4:(band_num+1)*4]
    band_hash = hash(tuple(band_values))
    
    # Store in MongoDB
    db.bands.update_one(
        {"band_id": band_hash, "band_number": band_num},
        {"$addToSet": {"function_ids": function_id}}
    )

Query process:

Calculate bands for query function
Lookup all bands in index
Collect all function IDs appearing in any band
These are MinHash candidates
Score candidates against query

Job Queue System

MCRIT uses a job queue for asynchronous processing of expensive operations.

Queue Implementations

LocalQueue
MongoQueue

In-memory queue using Python multiprocessing

Use case: Development, testing, single-node deployments
Persistence: None (jobs lost on restart)
Scalability: Single machine only

from mcrit.queue.LocalQueue import LocalQueue
queue = LocalQueue()

MongoDB-backed queue (default for production)

Use case: Production deployments
Persistence: Jobs survive restarts
Scalability: Multiple workers across machines

from mcrit.queue.QueueFactory import QueueFactory
queue = QueueFactory.getQueue(config)  # Returns MongoQueue

Job Lifecycle

Job Types

Common jobs in MCRIT:

addBinarySample - Disassemble and import binary
updateMinHashesForSample - Calculate MinHash signatures
getMatchesForSmdaReport - Match query against database
getMatchesForSample - Match sample vs all others
combineMatchesToCross - Cross-matching analysis
deleteSample - Remove sample and cleanup indices
rebuildIndex - Recreate band index from scratch

Source: mcrit/queue/LocalQueue.py, mcrit/Worker.py

Data Flow: Sample Import

Let’s trace what happens when you import a binary:

Client Submits Binary

client.addBinarySample(
    binary_bytes,
    family="Emotet",
    is_dump=False
)

API Creates Job

The REST API receives the request and creates a job:

job_id = index.addBinarySample(...)
# Returns immediately with job_id

Worker Picks Up Job

A worker claims the job from the queue and executes:

Disassemble with SMDA
Create SampleEntry in MongoDB
Create FunctionEntry for each function

Calculate Hashes

If calculate_hashes=True, spawns another job:

Generate MinHash for each function
Calculate PicHash for each function
Store signatures in MongoDB

Build Index

Split MinHash into bands
Insert into band index
Insert into PicHash index

Job Complete

Worker marks job as finished, stores result

Client Polls Result

status = client.getStatusForJob(job_id)
if status["state"] == "finished":
    result = client.getResultForJob(job_id)

Data Flow: Query Matching

When querying a new binary:

Submit Query

job_id = client.getMatchesForUnmappedBinary(binary_bytes)

Disassemble

Worker disassembles binary with SMDA (query sample, not stored)

Calculate MinHash

Generate MinHash signatures for all functions

PicHash Lookup

Check PicHash index for exact matches (very fast)

Band Lookup

Query band index for MinHash candidates

Score Candidates

Calculate MinHash scores for all candidates

Apply Threshold

Filter matches by configured threshold (e.g., 50%)

Return Results

Aggregate matches by sample/family, return matching report

Scalability Considerations

Horizontal Scaling

Run multiple workers to parallelize matchingEach worker can operate independently

Database Sharding

MongoDB supports sharding for very large datasetsShard by family_id or sample_id

Query Optimization

Use PicHash for initial filteringAdjust band configuration for speed/recall tradeoff

Storage Optimization

Use 8-bit MinHash signatures (4x less storage)Optionally drop xcfg after indexing

Configuration Points

Key configuration files:

McritConfig - Overall system config
MinHashConfig - Signature length, bits, thresholds
ShinglerConfig - Shingler weights and parameters
StorageConfig - MongoDB connection and settings
QueueConfig - Job queue configuration

Source: mcrit/config/

Deployment Patterns

Single Node
Multi-Worker
Docker Compose

All-in-one deployment:

Server + Worker + MongoDB on one machine
Use LocalQueue or MongoQueue
Good for: Small datasets (under 100k functions), development

# Terminal 1: Start MongoDB
mongod --dbpath ./data

# Terminal 2: Start Server
mcrit server

# Terminal 3: Start Worker
mcrit worker

Scaled processing:

One server node
Multiple worker nodes
Shared MongoDB
Good for: Large imports, heavy matching workloads

# Server node
mcrit server

# Worker node 1
mcrit worker

# Worker node 2
mcrit worker

# Worker node N
mcrit worker

Containerized deployment:

version: '3'
services:
  mongodb:
    image: mongo:5
    volumes:
      - ./data:/data/db
  
  server:
    image: mcrit:latest
    command: server
    depends_on:
      - mongodb
  
  worker:
    image: mcrit:latest
    command: spawningworker
    depends_on:
      - mongodb
    deploy:
      replicas: 4

MinHash

How signatures are generated and indexed

Shinglers

Feature extraction in the worker

PicHash

Exact matching in the index

Get Started

Deployment

Usage Guides

Core Concepts

System Architecture

Architecture Overview

Core Components

Server Component (REST API)

FamilyResource

SampleResource

FunctionResource

QueryResource

MatchResource

JobResource

MinHashIndex

Worker Components

Worker (Base Class)

SpawningWorker

SingleJobWorker

Storage Layer (MongoDB)

Collections

Band Index Structure

Job Queue System

Queue Implementations

Job Lifecycle

Job Types

Data Flow: Sample Import

Data Flow: Query Matching

Scalability Considerations

Horizontal Scaling

Database Sharding

Query Optimization

Storage Optimization

Configuration Points

Deployment Patterns

MinHash

Shinglers

PicHash

Build docs developers (and LLMs) love

Get Started

Deployment

Usage Guides

Core Concepts

​Architecture Overview

​Core Components

​Server Component (REST API)

FamilyResource

SampleResource

FunctionResource

QueryResource

MatchResource

JobResource

​MinHashIndex

​Worker Components

​Worker (Base Class)

​SpawningWorker

​SingleJobWorker

​Storage Layer (MongoDB)

​Collections

​Band Index Structure

​Job Queue System

​Queue Implementations

​Job Lifecycle

​Job Types

​Data Flow: Sample Import

​Data Flow: Query Matching

​Scalability Considerations

Horizontal Scaling

Database Sharding

Query Optimization

Storage Optimization

​Configuration Points

​Deployment Patterns

​Related Concepts

MinHash

Shinglers

PicHash

Build docs developers (and LLMs) love

Architecture Overview

Core Components

Server Component (REST API)

MinHashIndex

Worker Components

Worker (Base Class)

SpawningWorker

SingleJobWorker

Storage Layer (MongoDB)

Collections

Band Index Structure

Job Queue System

Queue Implementations

Job Lifecycle

Job Types

Data Flow: Sample Import

Data Flow: Query Matching

Scalability Considerations

Configuration Points

Deployment Patterns

Related Concepts