Core Design Patterns
1. Singleton Services
All core services (llmService, activeModelService, generationService, imageGenerationService) are singleton instances exported directly from their modules.
Why: Prevents duplicate model loading, concurrent inference conflicts, memory leaks from orphaned contexts, and state desynchronization.
- Thread safety: Promise deduplication ensures only one load operation at a time
- Consistency: All callers see the same loaded model state
- Resource safety: Only one native context exists at a time
2. Background-Safe Orchestration
generationService and imageGenerationService maintain state independently of React component lifecycle. Generation continues even when the user navigates away.
Implementation: Listener pattern with immediate state delivery on subscription.
- Services hold state in private fields (not React state)
- Listeners are weakly held via cleanup functions
- Subscribers receive current state immediately on mount
- No memory leaks — cleanup functions remove listeners on unmount
- User starts generation in ChatScreen
- ChatScreen subscribes to
generationService - User navigates to HomeScreen → ChatScreen unmounts, unsubscribes
- Generation continues, service maintains state
- HomeScreen mounts, subscribes, immediately receives current state (progress 15/20)
- User navigates back to ChatScreen
- ChatScreen re-subscribes, receives current state (progress 18/20)
- Generation completes, all subscribers notified
3. Memory-First Loading Strategy
All model loads check available RAM before proceeding. Prevents OOM crashes by blocking loads that would exceed safe limits. RAM Budget: 60% of device RAM Estimation Multipliers:- Text models:
fileSize × 1.5(KV cache + activations) - Vision models:
(modelFileSize + mmProjSize) × 1.5 - Image models:
fileSize × 1.8(MNN/QNN runtime overhead)
abort() during Metal/OpenCL allocation, killing the app before JavaScript catches the error. To prevent this:
| Device RAM | GPU Layers | Context Cap | CLIP GPU |
|---|---|---|---|
| ≤4GB | 0 (CPU-only) | 2048 | Off |
| 4-6GB | Requested | 2048 | On |
| 6-8GB | Requested | 4096 | On |
| >8GB | Requested | 8192 | On |
4. Combined Asset Tracking
Vision models track both main GGUF and mmproj files as a single logical unit. Why: mmproj files (100-700MB) are downloaded separately but required for vision inference. Memory estimates must include both.- User selects vision model (e.g., SmolVLM-500M)
- System downloads main GGUF file
- System detects vision capability, downloads mmproj automatically
- Both files linked in store:
mmProjPath,mmProjFileSize - On load, both passed to llmService
5. State Cleanup Patterns
After prompt enhancement (which usesllmService), explicit cleanup ensures text generation doesn’t hang.
Why: Prompt enhancement runs a separate LLM generation to expand simple prompts (“a dog” → detailed 75-word description). Without cleanup, the LLM service remains in “generating” state, blocking subsequent text generation.
- ✅ Vision inference stays fast (cached tokens reused)
- ⚠️ KV cache grows over time (cleared manually via settings or on model unload)
Token Batching (Performance Optimization)
Streaming token generation produces 10-50 tokens/second. Updating React state on every token causes excessive renders and frame drops. Solution: Batch tokens and flush to UI at a controlled rate (50ms intervals = ~20 updates/second).- 30 tok/s × 60fps = 1800 unnecessary renders/minute
- Janky scrolling, high CPU usage
- 20 UI updates/second regardless of generation speed
- Smooth scrolling, low overhead
Message Queue (Non-Blocking Input)
Users can send messages while the LLM is still generating. Messages are queued and processed automatically after the current generation completes.- User sends “What is AI?” → Generation starts
- User sends “Explain neural networks” → Added to queue (count: 1)
- User sends “Give me an example” → Added to queue (count: 2)
- First generation completes
- Queue processor aggregates messages:
- Combined message sent to LLM
- Send button stays active during generation
- Stop button visible alongside send button
- Queue count badge shows “2 queued messages”
- Tap badge to clear queue