Overview
Ghidra’s BSim (Behavioral Similarity) Database allows reverse engineers to ingest metadata about previously analyzed binary executables to a central server or local database. The database can then be queried to quickly discover previously seen functions and libraries in new, unknown executables.Key Features
Compilation Tolerant
Queries tolerate variations in function compilation
Fast Indexing
All records are indexed for quick queries, even with millions of functions
Decompiler-Based
Uses p-code from Ghidra’s decompiler for robust matching
Nearest Neighbor
Supports fuzzy matching with configurable similarity thresholds
How It Works
Feature Extraction
BSim extracts features from a concise description of function data-flow, not explicit machine instructions:- Based on Ghidra’s intermediate representation language (p-code)
- Generated by the Ghidra decompiler
- Graph-based abstract syntax tree representation
- Normalized to minimize compilation variation impact
Normalized Comparisons
The resulting function descriptions are normalized to tolerate variations due to:Equivalent Instructions
Equivalent Instructions
Different machine instructions that perform the same operation
Storage Locations
Storage Locations
Variations in register allocation, stack usage, and memory locations
Instruction Ordering
Instruction Ordering
Compiler-dependent instruction reordering
Compiler Transformations
Compiler Transformations
Many forms of compiler optimization and transformation
Obfuscation
Obfuscation
Even some forms of deliberate code obfuscation
Text Retrieval Strategies
Records are indexed using text retrieval strategies enabling:- Nearest neighbor queries: Features don’t need exact matches
- Configurable similarity: Set percentage thresholds for matches
- Functional tolerance: Match even when source code has changed slightly
- Microsecond queries: Single function results typically return in microseconds
For a database containing millions of functions, query results typically return in microseconds.
Database Technologies
BSim supports three database backends:| Backend | Use Case | Features |
|---|---|---|
| PostgreSQL | Production | Robust, multi-connection, fault-tolerant server |
| Elasticsearch | Distributed | Scalable across clusters, distributed indexing |
| H2 (local) | Development | Convenience for small personal collections |
PostgreSQL server software is currently only supported on Linux and macOS. Elasticsearch must be obtained separately. H2 databases are supported on all platforms.
Integration with Ghidra
Ghidra Server Integration
Plugin Client
Ghidra includes a plugin client that integrates:- Query dialog directly in the main CodeBrowser
- Results windows with side-by-side comparison
- Direct navigation to matching functions
Query API
Ghidra provides a Java API for:- Incorporating queries into analyst scripts
- Programmatic ingestion of executables
- Marshaling queries and results between Ghidra sessions and BSim servers
Database Configuration
BSim databases can be configured for different scenarios:Database Setup
Create and configure PostgreSQL, H2, or Elasticsearch backends
Feature Weights
Customize feature weights for domain-specific matching
Ingest Process
Batch ingest executables from repositories
Query Interface
Interactive and programmatic query options
Querying BSim Database
Query Types
- Single Function Query: Search for similar functions to a specific function
- Batch Query: Query multiple functions at once
- Overview Query: Get database statistics and metadata
- Executable Query: Find similar executables in the database
Query Parameters
Similarity threshold (0.0 to 1.0) for matching functions
Confidence threshold to filter low-quality matches
Maximum number of results to return
Using the Plugin
Ingesting Executables
Prerequisites
- Executables must be analyzed in Ghidra
- Decompilation must be run on functions
- Database must be created and accessible
Ingest Workflow
Ingest Options
Function Filtering
Function Filtering
Filter which functions to ingest based on size, complexity, or other criteria
Metadata Tags
Metadata Tags
Batch Processing
Batch Processing
Process large numbers of executables automatically
Update Mode
Update Mode
Re-ingest updated executables without duplicates
Command-Line Reference
Thebsim command-line utility provides comprehensive database management:
| Command | Description |
|---|---|
createdatabase | Initialize a new BSim database |
dropdatabase | Delete an existing database |
ingest | Add executables to database |
update | Update existing executable records |
delete | Remove executables from database |
queryfunctions | Query for similar functions |
queryexe | Query for similar executables |
dumpdb | Export database contents |
installmetadata | Install database schema |
Advanced Features
Features and Weights
Customize how BSim weights different aspects of function behavior:- Data-flow features: Weight importance of data operations
- Control-flow features: Emphasize branching patterns
- Call graph features: Consider function call relationships
- Constant features: Factor in constant values used
Performance Optimization
Use Cases
Malware Analysis
Identify known malware families and variants
Vulnerability Research
Find vulnerable code patterns across executables
Library Detection
Recognize commercial and open-source libraries
Code Reuse
Track code reuse and software lineage
Source Code References
Database Maintenance
Regular Tasks
- Backup: Regularly backup your BSim database
- Vacuum: Run database optimization (PostgreSQL)
- Monitor: Track database size and query performance
- Update: Keep ingested executables synchronized with analysis
Troubleshooting
Slow Queries
Slow Queries
Check database indexes, increase memory allocation, or optimize feature weights
Connection Issues
Connection Issues
Verify network connectivity, database server status, and authentication credentials
Ingest Failures
Ingest Failures
Ensure executables are fully analyzed and decompiled in Ghidra
Next Steps
Debugger
Perform dynamic analysis on executables
Version Tracking
Track changes between program versions
