Overview
The normalization process extracts different aspects of function bytecode:| IR Type | Focus | Use Case |
|---|---|---|
| Structural IR | Instruction sequences | Control-flow shape matching |
| Content IR1 | Non-identifier strings | Literal value matching |
| Content IR2 | Identifiers & objects | API and structure matching |
ToIR() method in pkg/hbc/types/functionobject.go:144.
Normalization Levels
Level 0: No Normalization
Raw bytecode disassembly without any IR generation. Used for direct inspection.Level 1: Content IR1 (Non-Identifier Strings)
Extracts and normalizes non-identifier string literals from the bytecode.Resolved instructions with
ResolvedRichData entries- Extract all
STRINGtype entries whereIsIdentifier = false - Convert to lowercase
- Strip pipe characters (
|) - Sort alphabetically
- Join with pipe delimiter
Pipe-delimited, sorted list of lowercased string literals
Level 2: Content IR2 (Identifiers & Objects)
Extracts identifiers and object structures from the bytecode.Resolved instructions with
ResolvedRichData entries- Extract all
STRINGentries whereIsIdentifier = true - Extract all
OBJECTentries whereIsIdentifier = false - Convert to lowercase
- Strip pipe characters (
|) - Sort alphabetically
- Join with pipe delimiter
Pipe-delimited, sorted list of identifiers and object references
Structural IR
The structural IR captures instruction flow independent of concrete values.Sequence of normalized instructions
- Prepend parameter count:
pc=N| - Append each instruction name followed by
| - Create bigrams for tokenization:
Inst1→Inst2
Pipe-delimited sequence of instruction names with parameter count prefix
Implementation Details
ToIR Method
Source:pkg/hbc/types/functionobject.go:144
Tokenization Methods
Structural IR Tokenization: Source:pkg/hbc/types/functionobject.go:71
pkg/hbc/types/functionobject.go:86
Hashing Strategy
Each IR is hashed using SHA256 for exact matching and MinHash for fuzzy similarity:| Hash Type | Algorithm | Purpose |
|---|---|---|
| Exact Match | SHA256 | Fast database lookup |
| Fuzzy Match | MinHash (128 permutations) | Similarity scoring |
pkg/analyzer/compute.go
Use in Analysis
During theanalyze command, all three IRs are compared:
- Exact matching - SHA256 lookup in MongoDB
- Fuzzy matching - MinHash Jaccard similarity with threshold (default 0.8)
- Length filtering - ±20% bytecode size tolerance before comparison
Performance Considerations
Operand details are excluded from structural IR to reduce storage size
IRs are computed once during disassembly and cached in the database
Trigram shingling increases token count 3x but enables partial string matching