Overview
Performance optimization in React Native ExecuTorch involves:- Model quantization to reduce size and increase speed
- Backend delegation for hardware acceleration
- Runtime configuration tuning
- Application-level optimizations
Model Quantization
Quantization reduces model precision from 32-bit floating point to lower bit representations, significantly improving performance.XNNPACK Quantization
XNNPACK is the recommended CPU backend for both iOS and Android:Dynamic Quantization
For models where static quantization is challenging:Per-Channel vs Per-Tensor
LLM Quantization Techniques
For Large Language Models, specialized quantization methods provide significant benefits:SpinQuant (Recommended)
SpinQuant offers excellent quality-to-size ratio:- Base model: 3.3 GB
- SpinQuant: 1.9 GB (42% reduction)
QLoRA Quantization
QLoRA provides another quantization option:Choosing Quantization for LLMs
| Method | Memory Usage | Quality | Best For |
|---|---|---|---|
| Base (no quant) | Highest | Best | Devices with 6GB+ RAM |
| SpinQuant | Medium | Excellent | Balanced performance/quality |
| QLoRA | Medium-Low | Good | Memory-constrained devices |
Backend Delegation
XNNPACK Backend
XNNPACK provides optimized CPU inference:- Highly optimized for ARM CPUs
- Excellent operator coverage
- Works on both iOS and Android
- Mature and stable
Core ML Backend (iOS Only)
Core ML can utilize iOS Neural Engine (ANE) for acceleration:- Can leverage GPU and Neural Engine
- Lower power consumption
- Better thermal characteristics
- iOS only
- Limited operator support vs XNNPACK
- May require fallback to CPU for some ops
Choosing a Backend
Runtime Optimization
LLM Generation Configuration
Optimize text generation parameters:Temperature and Sampling
Context Management
Manage conversation history to control memory and speed:Application-Level Optimizations
Preload Models
Load models during app startup or idle time:Cache Models Locally
Download models once and reuse:Batch Processing
For computer vision tasks, process multiple images efficiently:Interrupt Long Operations
Monitoring Performance
Track Token Generation Speed
Monitor Download Progress
Platform-Specific Optimizations
iOS
Android
Increase RAM allocation for emulators:Benchmarking Results
Based on measurements from the source repository:LLM Performance (iPhone 17 Pro)
| Model | Memory (GB) | Speed (est.) |
|---|---|---|
| LLAMA3_2_1B | 3.1 | Fast |
| LLAMA3_2_1B_SPINQUANT | 2.4 | Faster |
| LLAMA3_2_3B | 7.3 | Medium |
| LLAMA3_2_3B_SPINQUANT | 3.8 | Fast |
Computer Vision (iPhone 17 Pro)
| Model | Memory (MB) | Backend |
|---|---|---|
| EFFICIENTNET_V2_S | 87 | Core ML |
| SSDLITE_320_MOBILENET_V3_LARGE | 132 | XNNPACK |
Best Practices
- Always Quantize: Use quantization for production models
- Choose the Right Backend: XNNPACK for consistency, Core ML for iOS performance
- Limit Context: Use context strategies to manage memory
- Monitor Performance: Track metrics to identify bottlenecks
- Test on Real Devices: Emulators don’t reflect real-world performance
- Cache Models: Download once, use repeatedly
- Profile Your App: Use React Native DevTools to identify performance issues
Next Steps
- Learn about Memory Management strategies
- Explore Debugging performance issues
- Read about Custom Models export optimization