Overview
The Pipeline (pkg/pipeline/) orchestrates the fingerprinting of npm packages across 11 React Native environments (versions 0.69-0.79). It’s designed for batch efficiency and resume capability, processing hundreds of packages with minimal database round-trips and the ability to recover from failures.
Pipeline Workflow
Each package goes through a 4-step transformation:Step 1: Metro Bundling
pkg/pipeline/main.go:389
- Resolve all dependencies
- Tree-shake unused code
- Bundle into a single
.jsfile
global.MODULE_TO_BUNDLE = ModuleToBundle assignment prevents tree-shaking from removing the import.
Step 2: Hermes Compilation
pkg/pipeline/main.go:483
-O— Enable optimizations (dead code elimination, constant folding)-emit-binary— Output bytecode instead of AST-out— Output file path
Step 3: Disassembly
pkg/pipeline/main.go:529
CreateFunctionObjects.
Step 4: Hash Generation
pkg/pipeline/main.go:595
Parallel Processing Strategy
Multi-Environment Concurrency
The pipeline processes 11 RN versions in parallel using goroutines with semaphore-based throttling:pkg/pipeline/rnprocessor.go:108
- RN environments: 11 environments run in parallel (limited to 4 concurrent by semaphore)
- Package groups: Within each environment, packages are processed sequentially by group
- Package versions: All versions of a package are processed sequentially
node_modules directory. Parallel npm installs would cause race conditions:
Package Grouping
pkg/pipeline/packages.go:62
axios with 3 vulnerable versions:
- Without Grouping
- With Grouping
Environment Backup/Restore
Each RN environment is backed up before processing begins:pkg/pipeline/clean.go
npm ci?
- Speed: Copying directories (~2 seconds) vs.
npm ci(~30 seconds) - Reliability: No network dependency between packages
- Disk usage: ~300MB per RN environment × 11 = 3.3GB (acceptable)
Database Batching
BatchedWriter
pkg/pipeline/batcher.go:15
| Approach | Packages | Total Time | DB Calls |
|---|---|---|---|
| Individual inserts | 100 | ~45 seconds | 100 |
| Batched (100/batch) | 100 | ~8 seconds | 1 |
| Speedup | — | 5.6x faster | 99% fewer calls |
Progress Tracking
The pipeline supports resume capability via JSON-based progress files:pkg/pipeline/progress.go:23
React Native Environments
The pipeline expects 11 pre-configured RN environments inpipeline/react-natives/:
package.jsonwithreact-nativedependencynode_modules/with RN + Hermes installedbaseline_entry.jsfor baseline fingerprintingmetro.config.jsMetro bundler configuration
pipeline/setup_all_environments.sh
Design Decisions
Why process all RN versions instead of just the latest?
Why process all RN versions instead of just the latest?
Real-world distribution: According to npm stats, React Native adoption is spread across 5+ major versions:
- RN 0.71-0.73: 45% of apps
- RN 0.74-0.76: 35% of apps
- RN 0.77+: 15% of apps
- RN < 0.71: 5% of apps
Why not parallelize package processing within an environment?
Why not parallelize package processing within an environment?
npm install race conditions: Each RN environment has a single
node_modules directory. Running npm install pkg1 and npm install pkg2 in parallel would cause:- File system conflicts (both writing to
node_modules/.package-lock.json) - Dependency resolution conflicts (shared dependencies)
- Metro bundler cache corruption
Why batch size of 100 instead of larger?
Why batch size of 100 instead of larger?
Trade-off: Memory usage vs. database efficiency
- Larger batches (1000+): Fewer DB calls, but requires holding 1000+
PackageHashobjects in memory (~50MB) - Smaller batches (10): More DB calls (~10x overhead), but minimal memory usage
- 90% of the efficiency gain of larger batches
- Memory usage under 5MB per batch
- Fast flush on crashes (only lose 100 packages max)
Monitoring & Observability
Progress Summary
pkg/pipeline/progress.go:85
Next Steps
Analyzer Architecture
How fingerprints are matched against the database
Database Schema
MongoDB collections and indexes