Overview
MCRIT’s behavior can be customized through configuration files and environment variables. Configuration is organized into several domain-specific classes:- MinHashConfig: MinHash algorithm parameters
- StorageConfig: Database and storage settings
- QueueConfig: Job queue configuration
- GunicornConfig: Web server settings (Linux only)
- ShinglerConfig: Shingler algorithm settings
Configuration Methods
MCRIT can be configured in multiple ways:- Configuration files: Modify Python configuration classes in
mcrit/config/ - Environment variables: Set environment variables (recommended for Docker)
- Runtime parameters: Pass configuration dictionaries to components
Configuration classes are dataclasses that extend
ConfigInterface, providing serialization and logging capabilities.Main Configuration
McritConfig
The main configuration class that aggregates all other configurations. Location:mcrit/config/McritConfig.py
MCRIT version number
API authentication token. When set, clients must include this token in the
apitoken header field. Empty string disables authentication.Directory path for log files
Logging verbosity level. Options:
logging.DEBUG, logging.INFO, logging.WARNING, logging.ERRORPython logging format string
MinHash Configuration
Controls the MinHash algorithm behavior for code similarity detection. Location:mcrit/config/MinHashConfig.py
Core MinHash Settings
Number of hash values in each MinHash signature. Higher values increase accuracy but consume more memory and processing time.
Number of bits per signature element (1-32 bits). Affects hash value range and collision probability.
Random seed for initializing XOR values for MinHash seeds. Ensures deterministic hash generation.
MinHash calculation strategy to use
Function Filtering
Minimum number of instructions required for a function to be MinHashed. Functions below this threshold are ignored.
Alternative minimum: number of basic blocks required for MinHashing. Set to 0 to disable.
Matching Parameters
Lower bound percentage (0-100) at which paired MinHashes are considered a match
Minimum number of band matches required before a MinHash is considered a candidate for matching. Higher values reduce false positives but may miss some matches.
Performance Tuning
Enable multiprocessing for MinHash indexing operations. May need to be disabled when using Gunicorn/Falcon to avoid multiprocessing conflicts.
Enable multiprocessing for MinHash matching operations
Number of functions to process in each batch during matching
Number of candidate pairs to process per work unit during matching
Number of functions to process into MinHashes per work iteration
Batch size when rebuilding MinHash bands
PicHash Settings
Minimum function size (in instructions) for considering PicHash matching
When true, PicHash matches automatically imply MinHash matches without performing additional MinHash comparison
Advanced Options
Store the combination of shingles (unsorted) used to create each MinHash. Increases storage requirements but enables detailed analysis.
Storage Configuration
Configures the database backend and storage behavior. Location:mcrit/config/StorageConfig.py
Database Connection
Storage backend to use. Options:
StorageFactory.STORAGE_METHOD_MONGODB: MongoDB (recommended for production)StorageFactory.STORAGE_METHOD_MEMORY: In-memory storage (testing only)
MongoDB server hostname or IP address
MongoDB server port
MongoDB database name for storage collections
MongoDB authentication username. Set to
None to disable authentication.MongoDB authentication password. Set to
None to disable authentication.Additional MongoDB connection string flags (e.g.,
?authSource=admin&ssl=true)MongoDB Connection String Format
When authentication is enabled, the connection string is constructed as:The connection to MongoDB is fully configurable. Set
STORAGE_MONGODB_USERNAME and STORAGE_MONGODB_PASSWORD to enable authentication.Banding Strategy
Strategy for selecting MinHash fields for banding:
random: Randomly sample from MinHash fields (more fuzziness, may not use all fields)linear: Sequential selection of MinHash fields (requiressize × number = MINHASH_SIGNATURE_LENGTH)
Random seed for deriving band sequences. Ensures deterministic band generation.
Banding configuration as a dictionary with
size: number structure. Multiple band sizes can be used to increase scatter effect and randomness.Example: {4: 20} means 20 bands of 4 hash values each.Total number of bands (computed from
STORAGE_BANDS). Read-only property.Performance and Optimization
Use a hashmap to cache all banding data. Very memory-intensive but provides significant speedups.
Discard disassembly data from function entries after MinHashes are calculated. Reduces storage size but prevents future disassembly access.
Cleanup Configuration
Enable periodic deletion of queried samples and their results after a specified time
Time in seconds between cleanup operations (default: 7 days)
Time-to-live in seconds for queried samples (default: 7 days)
Export Limits
Maximum export size in bytes to protect against out-of-memory crashes (default: 1 GB)
Queue Configuration
Configures the job queue system for managing asynchronous tasks. Location:mcrit/config/QueueConfig.py
Queue Connection
Queue backend to use. Options:
QueueFactory.QUEUE_METHOD_MONGODB: MongoDB-based queue (recommended)QueueFactory.QUEUE_METHOD_FAKE: Fake queue for testing
MongoDB server hostname for the queue
MongoDB server port for the queue
MongoDB database name for queue collections
MongoDB authentication username for queue
MongoDB authentication password for queue
Additional MongoDB connection flags for queue
MongoDB collection name for storing queue items
Job Processing
Job timeout in seconds. Jobs exceeding this duration are marked as failed.
Maximum number of retry attempts for failed jobs
Time in seconds each worker waits between queue cleanup operations (default: 20 minutes)
Timeout in seconds for child processes spawned by SpawningWorker (default: 1 hour)
Gunicorn Configuration
Configures the Gunicorn WSGI server (Linux only). Location:mcrit/config/GunicornConfig.py
Enable Gunicorn as the WSGI server. Only works on Linux. When disabled or on Windows, waitress is used instead.
Host and port binding specified as
host:portNumber of Gunicorn worker processes to spawn
Number of threads per Gunicorn worker
Timeout in seconds before silent workers are killed and restarted
Gunicorn provides significantly better performance than waitress on Linux systems and is recommended for production deployments.
Shingler Configuration
Configures the shingling algorithms for feature extraction. Location:mcrit/config/ShinglerConfig.py
Directory path to search for files matching pattern
*Shingler.pyStrategy for applying shingler weights
Expected range for logbucket matching
Number of values to include in the range left and right of the center
Add additional counts for values closer to the center
Weights for different shinglers. Higher weights effectively run the shingler multiple times, increasing its influence.
Random seed for initializing XOR values for shingler hash seeds
Automatically set when shinglers are loaded. Listed for completeness only.
Environment Variables
Configuration values can be set via environment variables. The variable names typically match the configuration parameter names:Common Environment Variables
Docker Environment Example
Configuration Examples
Development Configuration
Production Configuration
High-Performance Configuration
Configuration Hash
Some configuration classes provide agetConfigHash() method that generates a SHA256 hash of critical parameters. This ensures configuration consistency across database migrations:
- MinHash strategy, signature length, and seed
- Function filtering thresholds
- PicHash parameters
Validating Configuration
Check your current configuration using the MCRIT client:Next Steps
- Deploy using Docker or Standalone
- Learn about MinHash concepts
- Explore the API reference