Skip to main content

Overview

MCRIT’s behavior can be customized through configuration files and environment variables. Configuration is organized into several domain-specific classes:
  • MinHashConfig: MinHash algorithm parameters
  • StorageConfig: Database and storage settings
  • QueueConfig: Job queue configuration
  • GunicornConfig: Web server settings (Linux only)
  • ShinglerConfig: Shingler algorithm settings

Configuration Methods

MCRIT can be configured in multiple ways:
  1. Configuration files: Modify Python configuration classes in mcrit/config/
  2. Environment variables: Set environment variables (recommended for Docker)
  3. Runtime parameters: Pass configuration dictionaries to components
Configuration classes are dataclasses that extend ConfigInterface, providing serialization and logging capabilities.

Main Configuration

McritConfig

The main configuration class that aggregates all other configurations. Location: mcrit/config/McritConfig.py
VERSION
string
default:"1.4.5"
MCRIT version number
AUTH_TOKEN
string
default:""
API authentication token. When set, clients must include this token in the apitoken header field. Empty string disables authentication.
LOG_PATH
string
default:"./"
Directory path for log files
LOG_LEVEL
logging level
default:"logging.INFO"
Logging verbosity level. Options: logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR
LOG_FORMAT
string
default:"%(asctime)-15s: %(name)-32s - %(message)s"
Python logging format string

MinHash Configuration

Controls the MinHash algorithm behavior for code similarity detection. Location: mcrit/config/MinHashConfig.py

Core MinHash Settings

MINHASH_SIGNATURE_LENGTH
integer
default:"64"
Number of hash values in each MinHash signature. Higher values increase accuracy but consume more memory and processing time.
MINHASH_SIGNATURE_BITS
integer
default:"8"
Number of bits per signature element (1-32 bits). Affects hash value range and collision probability.
MINHASH_SEED
integer
default:"0xDEADBEEF"
Random seed for initializing XOR values for MinHash seeds. Ensures deterministic hash generation.
MINHASH_STRATEGY
constant
default:"MinHasher.MINHASH_STRATEGY_SEGMENTED"
MinHash calculation strategy to use

Function Filtering

MINHASH_FN_MIN_INS
integer
default:"10"
Minimum number of instructions required for a function to be MinHashed. Functions below this threshold are ignored.
MINHASH_FN_MIN_BLOCKS
integer
default:"0"
Alternative minimum: number of basic blocks required for MinHashing. Set to 0 to disable.

Matching Parameters

MINHASH_MATCHING_THRESHOLD
integer
default:"50"
Lower bound percentage (0-100) at which paired MinHashes are considered a match
BAND_MATCHES_REQUIRED
integer
default:"2"
Minimum number of band matches required before a MinHash is considered a candidate for matching. Higher values reduce false positives but may miss some matches.

Performance Tuning

MINHASH_POOL_INDEXING
boolean
default:"true"
Enable multiprocessing for MinHash indexing operations. May need to be disabled when using Gunicorn/Falcon to avoid multiprocessing conflicts.
MINHASH_POOL_MATCHING
boolean
default:"true"
Enable multiprocessing for MinHash matching operations
MINHASH_MATCHING_FUNCTION_BATCH_SIZE
integer
default:"10000"
Number of functions to process in each batch during matching
MINHASH_MATCHING_CANDIDATE_WORKPACK_SIZE
integer
default:"20000"
Number of candidate pairs to process per work unit during matching
MINHASH_GENERATION_WORKPACK_SIZE
integer
default:"10000"
Number of functions to process into MinHashes per work iteration
MINHASH_BAND_REBUILD_WORK_PACKAGE_SIZE
integer
default:"100000"
Batch size when rebuilding MinHash bands

PicHash Settings

PICHASH_SIZE
integer
default:"10"
Minimum function size (in instructions) for considering PicHash matching
PICHASH_IMPLIES_MINHASH_MATCH
boolean
default:"true"
When true, PicHash matches automatically imply MinHash matches without performing additional MinHash comparison

Advanced Options

MINHASH_TRACK_SHINGLES
boolean
default:"false"
Store the combination of shingles (unsorted) used to create each MinHash. Increases storage requirements but enables detailed analysis.

Storage Configuration

Configures the database backend and storage behavior. Location: mcrit/config/StorageConfig.py

Database Connection

STORAGE_METHOD
constant
default:"StorageFactory.STORAGE_METHOD_MONGODB"
Storage backend to use. Options:
  • StorageFactory.STORAGE_METHOD_MONGODB: MongoDB (recommended for production)
  • StorageFactory.STORAGE_METHOD_MEMORY: In-memory storage (testing only)
STORAGE_SERVER
string
default:"127.0.0.1"
MongoDB server hostname or IP address
STORAGE_PORT
string
default:"27017"
MongoDB server port
STORAGE_MONGODB_DBNAME
string
default:"mcrit"
MongoDB database name for storage collections
STORAGE_MONGODB_USERNAME
string
default:"None"
MongoDB authentication username. Set to None to disable authentication.
STORAGE_MONGODB_PASSWORD
string
default:"None"
MongoDB authentication password. Set to None to disable authentication.
STORAGE_MONGODB_FLAGS
string
default:""
Additional MongoDB connection string flags (e.g., ?authSource=admin&ssl=true)

MongoDB Connection String Format

When authentication is enabled, the connection string is constructed as:
mongodb://[username]:[password]@[server]:[port]/[dbname][flags]
Example:
mongodb://mcrit:[email protected]:27017/mcrit?authSource=admin
The connection to MongoDB is fully configurable. Set STORAGE_MONGODB_USERNAME and STORAGE_MONGODB_PASSWORD to enable authentication.

Banding Strategy

STORAGE_BAND_STRATEGY
string
default:"random"
Strategy for selecting MinHash fields for banding:
  • random: Randomly sample from MinHash fields (more fuzziness, may not use all fields)
  • linear: Sequential selection of MinHash fields (requires size × number = MINHASH_SIGNATURE_LENGTH)
STORAGE_BAND_SEED
integer
default:"0xDEADBEEF"
Random seed for deriving band sequences. Ensures deterministic band generation.
STORAGE_BANDS
dict
default:"{4: 20}"
Banding configuration as a dictionary with size: number structure. Multiple band sizes can be used to increase scatter effect and randomness.Example: {4: 20} means 20 bands of 4 hash values each.
STORAGE_NUM_BANDS
integer
Total number of bands (computed from STORAGE_BANDS). Read-only property.

Performance and Optimization

STORAGE_CACHE
boolean
default:"false"
Use a hashmap to cache all banding data. Very memory-intensive but provides significant speedups.
STORAGE_DROP_DISASSEMBLY
boolean
default:"false"
Discard disassembly data from function entries after MinHashes are calculated. Reduces storage size but prevents future disassembly access.

Cleanup Configuration

STORAGE_MONGODB_ENABLE_CLEANUP
boolean
default:"false"
Enable periodic deletion of queried samples and their results after a specified time
STORAGE_MONGODB_CLEANUP_DELTA
integer
default:"604800"
Time in seconds between cleanup operations (default: 7 days)
STORAGE_MONGODB_CLEANUP_TTL
integer
default:"604800"
Time-to-live in seconds for queried samples (default: 7 days)

Export Limits

STORAGE_MAX_EXPORT_SIZE
integer
default:"1073741824"
Maximum export size in bytes to protect against out-of-memory crashes (default: 1 GB)

Queue Configuration

Configures the job queue system for managing asynchronous tasks. Location: mcrit/config/QueueConfig.py

Queue Connection

QUEUE_METHOD
constant
default:"QueueFactory.QUEUE_METHOD_MONGODB"
Queue backend to use. Options:
  • QueueFactory.QUEUE_METHOD_MONGODB: MongoDB-based queue (recommended)
  • QueueFactory.QUEUE_METHOD_FAKE: Fake queue for testing
QUEUE_SERVER
string
default:"127.0.0.1"
MongoDB server hostname for the queue
QUEUE_PORT
string
default:"27017"
MongoDB server port for the queue
QUEUE_MONGODB_DBNAME
string
default:"mcrit"
MongoDB database name for queue collections
QUEUE_MONGODB_USERNAME
string
default:"None"
MongoDB authentication username for queue
QUEUE_MONGODB_PASSWORD
string
default:"None"
MongoDB authentication password for queue
QUEUE_MONGODB_FLAGS
string
default:""
Additional MongoDB connection flags for queue
QUEUE_MONGODB_COLLECTION_NAME
string
default:"queue"
MongoDB collection name for storing queue items
Changing STORAGE_MONGODB_DBNAME or QUEUE_MONGODB_DBNAME at runtime only affects that specific component. Both must be changed independently.

Job Processing

QUEUE_TIMEOUT
integer
default:"300"
Job timeout in seconds. Jobs exceeding this duration are marked as failed.
QUEUE_MAX_ATTEMPTS
integer
default:"3"
Maximum number of retry attempts for failed jobs
QUEUE_CLEAN_INTERVAL
integer
default:"1200"
Time in seconds each worker waits between queue cleanup operations (default: 20 minutes)
QUEUE_SPAWNINGWORKER_CHILDREN_TIMEOUT
integer
default:"3600"
Timeout in seconds for child processes spawned by SpawningWorker (default: 1 hour)

Gunicorn Configuration

Configures the Gunicorn WSGI server (Linux only). Location: mcrit/config/GunicornConfig.py
USE_GUNICORN
boolean
default:"false"
Enable Gunicorn as the WSGI server. Only works on Linux. When disabled or on Windows, waitress is used instead.
BIND
string
default:"0.0.0.0:8000"
Host and port binding specified as host:port
WORKERS
integer
default:"4"
Number of Gunicorn worker processes to spawn
THREADS
integer
default:"8"
Number of threads per Gunicorn worker
TIMEOUT
integer
default:"120"
Timeout in seconds before silent workers are killed and restarted
Gunicorn provides significantly better performance than waitress on Linux systems and is recommended for production deployments.

Shingler Configuration

Configures the shingling algorithms for feature extraction. Location: mcrit/config/ShinglerConfig.py
SHINGLER_DIR
string
default:"{PROJECT_ROOT}/shinglers"
Directory path to search for files matching pattern *Shingler.py
SHINGLER_WEIGHT_STRATEGY
constant
default:"ShingleLoader.WEIGHT_STRATEGY_SHINGLER_WEIGHTS"
Strategy for applying shingler weights
SHINGLER_LOGBUCKETS
integer
default:"100000"
Expected range for logbucket matching
SHINGLER_LOGBUCKET_RANGE
integer
default:"1"
Number of values to include in the range left and right of the center
SHINGLER_LOGBUCKET_CENTERED
boolean
default:"true"
Add additional counts for values closer to the center
SHINGLERS_WEIGHTS
dict
Weights for different shinglers. Higher weights effectively run the shingler multiple times, increasing its influence.
SHINGLERS_SEED
integer
default:"0xDEADBEEF"
Random seed for initializing XOR values for shingler hash seeds
SHINGLERS_XOR_VALUES
list
default:"[]"
Automatically set when shinglers are loaded. Listed for completeness only.

Environment Variables

Configuration values can be set via environment variables. The variable names typically match the configuration parameter names:

Common Environment Variables

# Storage configuration
export STORAGE_SERVER=mongodb.example.com
export STORAGE_PORT=27017
export STORAGE_MONGODB_USERNAME=mcrit_user
export STORAGE_MONGODB_PASSWORD=secure_password
export STORAGE_MONGODB_DBNAME=mcrit_production

# Queue configuration
export QUEUE_SERVER=mongodb.example.com
export QUEUE_PORT=27017
export QUEUE_MONGODB_USERNAME=mcrit_user
export QUEUE_MONGODB_PASSWORD=secure_password

# Authentication
export AUTH_TOKEN=your_secret_api_token_here

# Gunicorn
export USE_GUNICORN=true
export WORKERS=8
export THREADS=4

Docker Environment Example

environment:
  - STORAGE_SERVER=mongodb
  - STORAGE_PORT=27017
  - STORAGE_MONGODB_USERNAME=mcrit
  - STORAGE_MONGODB_PASSWORD=${MONGODB_PASSWORD}
  - QUEUE_SERVER=mongodb
  - QUEUE_PORT=27017
  - AUTH_TOKEN=${MCRIT_API_TOKEN}

Configuration Examples

Development Configuration

from mcrit.config.MinHashConfig import MinHashConfig
from mcrit.config.StorageConfig import StorageConfig

# Smaller signature for faster processing
minhash_config = MinHashConfig(
    MINHASH_SIGNATURE_LENGTH=32,
    MINHASH_MATCHING_THRESHOLD=40,
    MINHASH_POOL_INDEXING=True
)

# In-memory storage for testing
storage_config = StorageConfig(
    STORAGE_METHOD=StorageFactory.STORAGE_METHOD_MEMORY
)

Production Configuration

from mcrit.config.MinHashConfig import MinHashConfig
from mcrit.config.StorageConfig import StorageConfig
from mcrit.config.GunicornConfig import GunicornConfig

# High-precision MinHash
minhash_config = MinHashConfig(
    MINHASH_SIGNATURE_LENGTH=128,
    MINHASH_MATCHING_THRESHOLD=60,
    MINHASH_POOL_INDEXING=False  # Disable for Gunicorn
)

# MongoDB with authentication
storage_config = StorageConfig(
    STORAGE_METHOD=StorageFactory.STORAGE_METHOD_MONGODB,
    STORAGE_SERVER="mongodb-cluster.internal",
    STORAGE_MONGODB_USERNAME="mcrit_prod",
    STORAGE_MONGODB_PASSWORD="secure_password",
    STORAGE_CACHE=True,  # Enable caching
    STORAGE_BANDS={4: 32}  # More bands for better accuracy
)

# Gunicorn for performance
gunicorn_config = GunicornConfig(
    USE_GUNICORN=True,
    WORKERS=16,
    THREADS=4,
    TIMEOUT=300
)

High-Performance Configuration

# Optimize for speed with multiple workers
minhash_config = MinHashConfig(
    MINHASH_GENERATION_WORKPACK_SIZE=50000,
    MINHASH_MATCHING_CANDIDATE_WORKPACK_SIZE=100000,
    MINHASH_POOL_MATCHING=True
)

queue_config = QueueConfig(
    QUEUE_TIMEOUT=600,  # 10 minutes for large jobs
    QUEUE_MAX_ATTEMPTS=5
)

Configuration Hash

Some configuration classes provide a getConfigHash() method that generates a SHA256 hash of critical parameters. This ensures configuration consistency across database migrations:
from mcrit.config.MinHashConfig import MinHashConfig

config = MinHashConfig()
config_hash = config.getConfigHash()
print(config_hash)  # e.g., "a3f2b1c9d8e7f6..."
The configuration hash includes:
  • MinHash strategy, signature length, and seed
  • Function filtering thresholds
  • PicHash parameters
Changing configuration parameters that affect hash generation will invalidate existing MinHashes. A database migration and recalculation may be required.

Validating Configuration

Check your current configuration using the MCRIT client:
mcrit client status
This returns information about the database state and storage configuration:
{
  "status": {
    "db_state": 187,
    "storage_type": "mongodb",
    "num_bands": 20,
    "num_samples": 137,
    "num_families": 14,
    "num_functions": 129110,
    "num_pichashes": 25385
  }
}

Next Steps

Build docs developers (and LLMs) love