Skip to main content

Core Design Patterns

1. Singleton Services

All core services (llmService, activeModelService, generationService, imageGenerationService) are singleton instances exported directly from their modules. Why: Prevents duplicate model loading, concurrent inference conflicts, memory leaks from orphaned contexts, and state desynchronization.
class ActiveModelService {
  private loadedTextModelId: string | null = null;
  private textLoadPromise: Promise<void> | null = null;

  async loadTextModel(modelId: string) {
    // Guard against concurrent loads
    if (this.textLoadPromise) {
      await this.textLoadPromise;
      if (this.loadedTextModelId === modelId) return;
    }

    // Only one load at a time
    this.loadingState.text = true;
    this.textLoadPromise = doLoadTextModel(...);
    await this.textLoadPromise;
  }
}

// Single instance exported
export const activeModelService = new ActiveModelService();
Benefits:
  • Thread safety: Promise deduplication ensures only one load operation at a time
  • Consistency: All callers see the same loaded model state
  • Resource safety: Only one native context exists at a time

2. Background-Safe Orchestration

generationService and imageGenerationService maintain state independently of React component lifecycle. Generation continues even when the user navigates away. Implementation: Listener pattern with immediate state delivery on subscription.
class GenerationService {
  private state: GenerationState = { isGenerating: false, ... };
  private listeners: Set<GenerationListener> = new Set();

  subscribe(listener: GenerationListener): () => void {
    this.listeners.add(listener);
    listener(this.getState()); // Immediate state on subscribe
    return () => this.listeners.delete(listener);
  }

  private notifyListeners(): void {
    const state = this.getState();
    this.listeners.forEach(listener => listener(state));
  }
}
Key Points:
  • Services hold state in private fields (not React state)
  • Listeners are weakly held via cleanup functions
  • Subscribers receive current state immediately on mount
  • No memory leaks — cleanup functions remove listeners on unmount
Flow Example:
  1. User starts generation in ChatScreen
  2. ChatScreen subscribes to generationService
  3. User navigates to HomeScreen → ChatScreen unmounts, unsubscribes
  4. Generation continues, service maintains state
  5. HomeScreen mounts, subscribes, immediately receives current state (progress 15/20)
  6. User navigates back to ChatScreen
  7. ChatScreen re-subscribes, receives current state (progress 18/20)
  8. Generation completes, all subscribers notified

3. Memory-First Loading Strategy

All model loads check available RAM before proceeding. Prevents OOM crashes by blocking loads that would exceed safe limits. RAM Budget: 60% of device RAM Estimation Multipliers:
  • Text models: fileSize × 1.5 (KV cache + activations)
  • Vision models: (modelFileSize + mmProjSize) × 1.5
  • Image models: fileSize × 1.8 (MNN/QNN runtime overhead)
export async function checkMemoryForModel(
  params: CheckMemoryParams
): Promise<MemoryCheckResult> {
  const { modelId, modelType, ids, lists } = params;
  const model = findModel(modelId, modelType, lists);

  // Estimate required RAM
  const estimatedRAM = estimateModelMemory(model, modelType);

  // Get device RAM budget (60% of total)
  const deviceRAM = await hardware.getDeviceInfo();
  const budget = (deviceRAM.totalMemory / (1024**3)) * 0.60;

  // Check against currently loaded models
  const currentlyLoaded = getCurrentlyLoadedMemoryGB(ids, lists);
  const totalRequired = estimatedRAM + currentlyLoaded;

  if (totalRequired > budget) {
    return {
      canLoad: false,
      severity: 'critical',
      message: `Cannot load ${model.name} (~${estimatedRAM.toFixed(1)}GB required) - would exceed device safe limit of ${budget.toFixed(1)}GB.`
    };
  }

  return { canLoad: true, severity: 'ok', message: '' };
}
User-Friendly Messages:
⚠️  Warning: Loading Qwen3-7B-Q4_K_M will use ~5.2GB of 6.0GB budget (87%)
❌  Cannot load Qwen3-7B-Q4_K_M (~5.5GB required) - would exceed device safe limit of 4.8GB. Unload current model or choose smaller.
RAM-Aware Runtime Safeguards: On low-RAM devices (≤4GB), llama.cpp can call abort() during Metal/OpenCL allocation, killing the app before JavaScript catches the error. To prevent this:
private async initWithAutoContext(
  params: { baseParams: object; ctxLen: number; nGpuLayers: number }
): Promise<{ context: LlamaContext; gpuAttemptFailed: boolean; actualLength: number }> {
  const deviceInfo = await hardwareService.getDeviceInfo();

  // Cap GPU layers based on device RAM
  const safeGpuLayers = getGpuLayersForDevice(
    deviceInfo.totalMemory,
    params.nGpuLayers
  );

  if (safeGpuLayers !== params.nGpuLayers) {
    logger.log(
      `[LLM] Low RAM (${(deviceInfo.totalMemory / BYTES_PER_GB).toFixed(1)}GB), GPU layers ${params.nGpuLayers}${safeGpuLayers}`
    );
  }

  // Load with safe GPU layer count
  const initial = await initContextWithFallback(
    params.baseParams,
    params.ctxLen,
    safeGpuLayers
  );

  return initial;
}
Device RAMGPU LayersContext CapCLIP GPU
≤4GB0 (CPU-only)2048Off
4-6GBRequested2048On
6-8GBRequested4096On
>8GBRequested8192On

4. Combined Asset Tracking

Vision models track both main GGUF and mmproj files as a single logical unit. Why: mmproj files (100-700MB) are downloaded separately but required for vision inference. Memory estimates must include both.
interface DownloadedModel {
  id: string;
  filePath: string;
  fileSize: number;

  // Vision-specific
  mmProjPath?: string;
  mmProjFileSize?: number;
  isVisionModel: boolean;
}
Download Flow:
  1. User selects vision model (e.g., SmolVLM-500M)
  2. System downloads main GGUF file
  3. System detects vision capability, downloads mmproj automatically
  4. Both files linked in store: mmProjPath, mmProjFileSize
  5. On load, both passed to llmService

5. State Cleanup Patterns

After prompt enhancement (which uses llmService), explicit cleanup ensures text generation doesn’t hang. Why: Prompt enhancement runs a separate LLM generation to expand simple prompts (“a dog” → detailed 75-word description). Without cleanup, the LLM service remains in “generating” state, blocking subsequent text generation.
private async _resetLlmAfterEnhancement(): Promise<void> {
  logger.log('[ImageGen] 🔄 Starting cleanup - generating:', llmService.isCurrentlyGenerating());
  try {
    // Stop generation flag, clear abort state
    await llmService.stopGeneration();
    logger.log('[ImageGen] ✓ stopGeneration() called');

    // NOTE: KV cache NOT cleared to preserve vision inference speed
    // Vision inference can be 30-60s slower if KV cache cleared after every enhancement

    logger.log('[ImageGen] ✅ LLM service reset complete');
  } catch (resetError) {
    logger.error('[ImageGen] ❌ Failed to reset LLM service:', resetError);
  }
}
Key Decision: KV cache is NOT cleared after enhancement. Clearing the cache would invalidate all cached tokens, forcing full re-computation on the next vision inference. For large vision models, this adds 30-60 seconds to inference time. Trade-off:
  • ✅ Vision inference stays fast (cached tokens reused)
  • ⚠️ KV cache grows over time (cleared manually via settings or on model unload)

Token Batching (Performance Optimization)

Streaming token generation produces 10-50 tokens/second. Updating React state on every token causes excessive renders and frame drops. Solution: Batch tokens and flush to UI at a controlled rate (50ms intervals = ~20 updates/second).
class GenerationService {
  private tokenBuffer: string = '';
  private flushTimer: ReturnType<typeof setTimeout> | null = null;
  private static readonly FLUSH_INTERVAL_MS = 50; // ~20 updates/sec

  private flushTokenBuffer(): void {
    if (this.tokenBuffer) {
      useChatStore.getState().appendToStreamingMessage(this.tokenBuffer);
      this.tokenBuffer = '';
    }
    this.flushTimer = null;
  }

  private forceFlushTokens(): void {
    if (this.flushTimer) {
      clearTimeout(this.flushTimer);
      this.flushTimer = null;
    }
    this.flushTokenBuffer();
  }
}
Before:
  • 30 tok/s × 60fps = 1800 unnecessary renders/minute
  • Janky scrolling, high CPU usage
After:
  • 20 UI updates/second regardless of generation speed
  • Smooth scrolling, low overhead

Message Queue (Non-Blocking Input)

Users can send messages while the LLM is still generating. Messages are queued and processed automatically after the current generation completes.
enqueueMessage(entry: QueuedMessage): void {
  this.state = {
    ...this.state,
    queuedMessages: [...this.state.queuedMessages, entry]
  };
  this.notifyListeners();
}

private processNextInQueue(): void {
  if (this.state.queuedMessages.length === 0 || !this.queueProcessor) return;

  // Aggregate all queued messages
  const all = this.state.queuedMessages;
  this.state = { ...this.state, queuedMessages: [] };
  this.notifyListeners();

  const combined: QueuedMessage = all.length === 1 ? all[0] : {
    id: all[0].id,
    conversationId: all[0].conversationId,
    text: all.map(m => m.text).join('\n\n'),
    attachments: all.flatMap(m => m.attachments || []),
    messageText: all.map(m => m.messageText).join('\n\n'),
  };

  this.queueProcessor(combined).catch(e => {
    logger.error('[GenerationService] Queue processor error:', e);
  });
}
Flow:
  1. User sends “What is AI?” → Generation starts
  2. User sends “Explain neural networks” → Added to queue (count: 1)
  3. User sends “Give me an example” → Added to queue (count: 2)
  4. First generation completes
  5. Queue processor aggregates messages:
    Explain neural networks
    
    Give me an example
    
  6. Combined message sent to LLM
UI Indicators:
  • Send button stays active during generation
  • Stop button visible alongside send button
  • Queue count badge shows “2 queued messages”
  • Tap badge to clear queue

Build docs developers (and LLMs) love