Skip to main content

Overview

Ghidra’s BSim (Behavioral Similarity) Database allows reverse engineers to ingest metadata about previously analyzed binary executables to a central server or local database. The database can then be queried to quickly discover previously seen functions and libraries in new, unknown executables.

Key Features

Compilation Tolerant

Queries tolerate variations in function compilation

Fast Indexing

All records are indexed for quick queries, even with millions of functions

Decompiler-Based

Uses p-code from Ghidra’s decompiler for robust matching

Nearest Neighbor

Supports fuzzy matching with configurable similarity thresholds

How It Works

Feature Extraction

BSim extracts features from a concise description of function data-flow, not explicit machine instructions:
  • Based on Ghidra’s intermediate representation language (p-code)
  • Generated by the Ghidra decompiler
  • Graph-based abstract syntax tree representation
  • Normalized to minimize compilation variation impact

Normalized Comparisons

The resulting function descriptions are normalized to tolerate variations due to:
Different machine instructions that perform the same operation
Variations in register allocation, stack usage, and memory locations
Compiler-dependent instruction reordering
Many forms of compiler optimization and transformation
Even some forms of deliberate code obfuscation

Text Retrieval Strategies

Records are indexed using text retrieval strategies enabling:
  • Nearest neighbor queries: Features don’t need exact matches
  • Configurable similarity: Set percentage thresholds for matches
  • Functional tolerance: Match even when source code has changed slightly
  • Microsecond queries: Single function results typically return in microseconds
For a database containing millions of functions, query results typically return in microseconds.

Database Technologies

BSim supports three database backends:
BackendUse CaseFeatures
PostgreSQLProductionRobust, multi-connection, fault-tolerant server
ElasticsearchDistributedScalable across clusters, distributed indexing
H2 (local)DevelopmentConvenience for small personal collections
PostgreSQL server software is currently only supported on Linux and macOS. Elasticsearch must be obtained separately. H2 databases are supported on all platforms.

Integration with Ghidra

Ghidra Server Integration

1

Repository Integration

Ingest from Ghidra Server or local project repositories
2

Query Results

Results reference executables within repositories
3

Command-Line Tools

Easy ingestion using the bsim command script

Plugin Client

Ghidra includes a plugin client that integrates:
  • Query dialog directly in the main CodeBrowser
  • Results windows with side-by-side comparison
  • Direct navigation to matching functions
# Command-line ingestion example
bsim createdatabase postgresql://localhost/mydb
bsim ingest postgresql://localhost/mydb /path/to/ghidra/project

Query API

Ghidra provides a Java API for:
  • Incorporating queries into analyst scripts
  • Programmatic ingestion of executables
  • Marshaling queries and results between Ghidra sessions and BSim servers
// Source: BSimOverview.html:168-177
// The API allows queries and ingest to be incorporated
// into analyst scripts, marshaling data between an active
// Ghidra session and a BSim server

Database Configuration

BSim databases can be configured for different scenarios:

Database Setup

Create and configure PostgreSQL, H2, or Elasticsearch backends

Feature Weights

Customize feature weights for domain-specific matching

Ingest Process

Batch ingest executables from repositories

Query Interface

Interactive and programmatic query options

Querying BSim Database

Query Types

  1. Single Function Query: Search for similar functions to a specific function
  2. Batch Query: Query multiple functions at once
  3. Overview Query: Get database statistics and metadata
  4. Executable Query: Find similar executables in the database

Query Parameters

similarity
number
default:"0.7"
Similarity threshold (0.0 to 1.0) for matching functions
confidence
number
default:"0.0"
Confidence threshold to filter low-quality matches
maxResults
number
default:"100"
Maximum number of results to return

Using the Plugin

1

Select Function

Right-click on a function in the CodeBrowser listing
2

Launch Query

Select BSim → Search for Similar Functions
3

Configure Search

Set similarity threshold and other parameters
4

Review Results

Examine matches in the results window with similarity scores

Ingesting Executables

Prerequisites

  • Executables must be analyzed in Ghidra
  • Decompilation must be run on functions
  • Database must be created and accessible

Ingest Workflow

# Create database
bsim createdatabase postgresql://localhost:5432/malware_db

# Ingest from Ghidra repository
bsim ingest postgresql://localhost:5432/malware_db ghidra://server/repo

# Ingest from local project
bsim ingest postgresql://localhost:5432/malware_db /home/user/ghidra_project

Ingest Options

Filter which functions to ingest based on size, complexity, or other criteria
Associate custom metadata tags with ingested executables
Process large numbers of executables automatically
Re-ingest updated executables without duplicates

Command-Line Reference

The bsim command-line utility provides comprehensive database management:
CommandDescription
createdatabaseInitialize a new BSim database
dropdatabaseDelete an existing database
ingestAdd executables to database
updateUpdate existing executable records
deleteRemove executables from database
queryfunctionsQuery for similar functions
queryexeQuery for similar executables
dumpdbExport database contents
installmetadataInstall database schema
Ensure you have proper database permissions before running administrative commands.

Advanced Features

Features and Weights

Customize how BSim weights different aspects of function behavior:
  • Data-flow features: Weight importance of data operations
  • Control-flow features: Emphasize branching patterns
  • Call graph features: Consider function call relationships
  • Constant features: Factor in constant values used

Performance Optimization

For optimal performance with large databases:
  • Use PostgreSQL or Elasticsearch for production
  • Configure appropriate database indexes
  • Allocate sufficient memory to database server
  • Use batch queries when analyzing multiple functions

Use Cases

Malware Analysis

Identify known malware families and variants

Vulnerability Research

Find vulnerable code patterns across executables

Library Detection

Recognize commercial and open-source libraries

Code Reuse

Track code reuse and software lineage

Source Code References

# Main implementation
~/workspace/source/Ghidra/Features/BSim/

# Help documentation  
BSim/src/main/help/help/topics/BSim/

# Database schemas
BSim/data/

# Command-line scripts
BSim/support/

Database Maintenance

Regular Tasks

  • Backup: Regularly backup your BSim database
  • Vacuum: Run database optimization (PostgreSQL)
  • Monitor: Track database size and query performance
  • Update: Keep ingested executables synchronized with analysis

Troubleshooting

Check database indexes, increase memory allocation, or optimize feature weights
Verify network connectivity, database server status, and authentication credentials
Ensure executables are fully analyzed and decompiled in Ghidra

Next Steps

Debugger

Perform dynamic analysis on executables

Version Tracking

Track changes between program versions

Build docs developers (and LLMs) love