Overview
Each function is fingerprinted using three complementary approaches:| Hash Type | Focus | Algorithm | Use Case |
|---|---|---|---|
| Structural Hash | Instruction flow | SHA256 + MinHash | Control-flow matching |
| Content IR1 Hash | String literals | SHA256 + MinHash | Literal value matching |
| Content IR2 Hash | Identifiers & objects | SHA256 + MinHash | API structure matching |
ToIR() method.
See IR Normalization for details on how IRs are generated.
Hash Type 1: Structural Hash
Captures the control-flow shape of a function independent of concrete values.Input
Structural IR: pipe-delimited sequence of instruction names with parameter count prefix.Exact Match Hash
Computes SHA256 of the structural IR string
64-character hexadecimal hash
structural_hashFuzzy Match Hash
Generates 128 hash permutations from instruction bigrams
Creates tokens like
LoadParam→AddArray of 128 hash signatures
structural_minhashpkg/analyzer/minhasher.go
Hash Type 2: Content IR1 Hash
Captures string literals (non-identifier strings) used in the function.Input
Content IR1: pipe-delimited, sorted list of lowercased string literals.Exact Match Hash
Computes SHA256 of the content IR1 string
64-character hexadecimal hash (empty string if no literals)
content_ir1_hashFuzzy Match Hash
Generates 128 hash permutations from string tokens and trigrams
Each string ≥3 chars generates shingled tokens:
error → ["err", "rro", "ror"]Array of 128 hash signatures
content_ir1_minhashHash Type 3: Content IR2 Hash
Captures identifiers and object structures referenced in the function.Input
Content IR2: pipe-delimited, sorted list of identifiers and object references.Exact Match Hash
Computes SHA256 of the content IR2 string
64-character hexadecimal hash (empty string if no identifiers)
content_ir2_hashFuzzy Match Hash
Generates 128 hash permutations from identifier tokens and trigrams
Each identifier ≥3 chars generates shingled tokens
Array of 128 hash signatures
content_ir2_minhashHash Generation Pipeline
1. Disassemble Bytecode
Parse HBC file and createFunctionObject representations:
pkg/hbc/normalizer.go:20
2. Generate IRs
Extract structural and content IRs:pkg/hbc/types/functionobject.go:144
3. Compute Hashes
Generate both exact and fuzzy hashes:pkg/analyzer/compute.go
4. Store in Database
Hashes are stored per package per React Native version:pkg/database/models/package_hashes_model.go
Database Schema
Each function hash document contains:| Field | Type | Description |
|---|---|---|
function_name | string | Function identifier |
bytecode_size | int | Size in bytes |
param_count | int | Number of parameters |
structural_hash | string | SHA256 of structural IR |
structural_minhash | []uint64 | MinHash signatures (128) |
content_ir1_hash | string | SHA256 of content IR1 |
content_ir1_minhash | []uint64 | MinHash signatures (128) |
content_ir2_hash | string | SHA256 of content IR2 |
content_ir2_minhash | []uint64 | MinHash signatures (128) |
Matching Strategy
During analysis, functions are matched using a cascading approach:1. Length Pre-filtering
2. Exact Match (SHA256)
3. Fuzzy Match (MinHash)
Only if fuzzy matching is enabled (-f flag):
pkg/cmd/analyze.go
Performance Characteristics
MongoDB indexed lookup on SHA256 hash fields
Linear scan with length pre-filtering reduces comparison space by ~80%
6 hash fields (3 SHA256 + 3 MinHash arrays)
MinHash Implementation
Source:pkg/analyzer/minhasher.go
Parameters
Number of permutations for signature generation
Fast non-cryptographic hash for token processing
Algorithm
Use Cases by Hash Type
| Scenario | Primary Hash | Fallback Hash |
|---|---|---|
| Exact code match | Structural | - |
| Renamed variables | Structural | Content IR1 |
| Internationalized strings | Structural | Content IR2 |
| Obfuscated code | Content IR1/IR2 | Structural |
| Refactored code | Content IR2 | Content IR1 |