Architecture Overview
MCRIT follows a client-server architecture with asynchronous job processing:Core Components
Server Component (REST API)
The server provides a REST API built with Falcon, a high-performance Python web framework.FamilyResource
Manage malware families
/families, /family/{id}SampleResource
Import and manage samples
/samples, /sample/{id}FunctionResource
Query function details
/functions, /function/{id}QueryResource
Query by binary or PicHash
/query/*MatchResource
Get matching results
/matches/*JobResource
Monitor job status
/jobs, /job/{id}- Stateless - No session management
- Asynchronous - Long operations return job IDs
- REST-compliant - Standard HTTP methods and status codes
mcrit/server/application_routes.py
MinHashIndex
TheMinHashIndex is the main entry point for all analysis operations. It coordinates between the API, storage, and workers.
- Responsibilities
- Key Methods
- Delegation Pattern
- Sample Management - Import/delete samples
- Query Coordination - Route queries to workers
- Result Retrieval - Fetch completed job results
- Index Maintenance - Rebuild indices, cleanup
- Configuration - Manage thresholds and settings
mcrit/index/MinHashIndex.py:61
Worker Components
MCRIT supports three worker types for different deployment scenarios:Worker (Base Class)
Standard worker that processes jobs synchronously in its own context.- One job at a time
- Keeps state in memory
- Good for: Docker deployments, simple setups
mcrit/Worker.py:50
SpawningWorker
Spawns a new process (SingleJobWorker) for each job, providing isolation.- Spawns subprocess for each job
- Better fault isolation
- Memory cleanup between jobs
- Good for: Long-running services, memory leak protection
mcrit/SpawningWorker.py:51
SingleJobWorker
Processes exactly one job then exits. Used by SpawningWorker or for manual job processing.- One job, then terminate
- Clean process per job
- Good for: Spawned by SpawningWorker, debugging specific jobs
mcrit/SingleJobWorker.py:52
Storage Layer (MongoDB)
MCRIT uses MongoDB for all persistent storage with specialized indices for performance.Collections
families
families
Malware family metadata
samples
samples
Binary sample metadata
functions
functions
Function entries with MinHash signaturesIndices:
{function_id: 1}- Primary key{sample_id: 1}- Functions by sample{pichash: 1}- Fast PicHash lookup
bands
bands
MinHash band index for LSHEach function’s signature is split into bands (e.g., 64 bands × 4 values). Functions sharing a band hash are candidates.Index:
{band_id: 1, band_number: 1}- Fast candidate lookup
picblockhashes
picblockhashes
Basic block hash indexIndex:
{hash: 1}- Fast block lookup
jobs
jobs
Asynchronous job tracking
mcrit/storage/MongoDbStorage.py
Band Index Structure
The band index is critical for MinHash performance:- Calculate bands for query function
- Lookup all bands in index
- Collect all function IDs appearing in any band
- These are MinHash candidates
- Score candidates against query
Job Queue System
MCRIT uses a job queue for asynchronous processing of expensive operations.Queue Implementations
- LocalQueue
- MongoQueue
In-memory queue using Python multiprocessing
- Use case: Development, testing, single-node deployments
- Persistence: None (jobs lost on restart)
- Scalability: Single machine only
Job Lifecycle
Job Types
Common jobs in MCRIT:addBinarySample- Disassemble and import binaryupdateMinHashesForSample- Calculate MinHash signaturesgetMatchesForSmdaReport- Match query against databasegetMatchesForSample- Match sample vs all otherscombineMatchesToCross- Cross-matching analysisdeleteSample- Remove sample and cleanup indicesrebuildIndex- Recreate band index from scratch
mcrit/queue/LocalQueue.py, mcrit/Worker.py
Data Flow: Sample Import
Let’s trace what happens when you import a binary:Worker Picks Up Job
A worker claims the job from the queue and executes:
- Disassemble with SMDA
- Create SampleEntry in MongoDB
- Create FunctionEntry for each function
Calculate Hashes
If
calculate_hashes=True, spawns another job:- Generate MinHash for each function
- Calculate PicHash for each function
- Store signatures in MongoDB
Data Flow: Query Matching
When querying a new binary:Scalability Considerations
Horizontal Scaling
Run multiple workers to parallelize matchingEach worker can operate independently
Database Sharding
MongoDB supports sharding for very large datasetsShard by family_id or sample_id
Query Optimization
Use PicHash for initial filteringAdjust band configuration for speed/recall tradeoff
Storage Optimization
Use 8-bit MinHash signatures (4x less storage)Optionally drop xcfg after indexing
Configuration Points
Key configuration files:McritConfig- Overall system configMinHashConfig- Signature length, bits, thresholdsShinglerConfig- Shingler weights and parametersStorageConfig- MongoDB connection and settingsQueueConfig- Job queue configuration
mcrit/config/
Deployment Patterns
- Single Node
- Multi-Worker
- Docker Compose
All-in-one deployment:
- Server + Worker + MongoDB on one machine
- Use LocalQueue or MongoQueue
- Good for: Small datasets (under 100k functions), development
Related Concepts
MinHash
How signatures are generated and indexed
Shinglers
Feature extraction in the worker
PicHash
Exact matching in the index