Skip to main content

Context Limits

Context limits control how much information from linked documents is included when using Enhanced Actions (RAG). Choosing the right limit is crucial for balancing answer quality, performance, and cost.

Available Presets

Local GPT offers four context limit presets, each designed for different use cases:

Local models

10,000 charactersOptimized for local models with smaller context windows (e.g., 8K-32K tokens).

Cloud models

32,000 charactersSuitable for standard cloud models with medium context windows (32K-64K tokens).

Top: GPT, Claude, Gemini

100,000 charactersFor advanced models with large context windows (100K+ tokens).

No limits (danger)

3,000,000 charactersEffectively unlimited. Use with extreme caution.

Configuration

Context limits are configured globally in the plugin settings:
1

Open Settings

Navigate to Settings → Local GPT → Advanced settings
2

Locate RAG Context

Find the Enhanced Actions section and look for RAG context
3

Select Preset

Choose the appropriate preset for your primary AI model
The setting is stored in settings.defaults.contextLimit with values: "local", "cloud", "advanced", or "max"

Implementation Details

The context limit is resolved in src/main.ts:783-796:
private resolveContextLimit(): number {
  const preset = this.settings?.defaults?.contextLimit as
    | "local"
    | "cloud" 
    | "advanced"
    | "max";
  const map: Record<string, number> = {
    local: 10_000,
    cloud: 32_000,
    advanced: 100_000,
    max: 3_000_000,
  };
  return map[preset];
}
The limit is enforced during context formatting in src/rag.ts:368-390:
for (const [basename, groupResults] of groups) {
  if (totalLength >= contextLimit) break;
  
  formattedResults += `[[${basename}]]\n`;
  // Add chunks until limit reached
}

How to Choose the Right Preset

Best for:
  • Ollama models (Llama 3, Mistral, Gemma, etc.)
  • LM Studio
  • Other local inference servers
  • Models with 8K-32K token context windows
Why this limit:
  • Most local models have limited context windows
  • Prevents out-of-memory errors
  • Maintains fast inference speed
  • Focuses on only the most relevant chunks
Approximate token count: ~2,500-3,000 tokens (assuming 4 chars/token)
Best for:
  • GPT-3.5-turbo
  • Claude 3 Haiku
  • Gemini 1.5 Flash
  • Standard API models
Why this limit:
  • Balances context richness with cost
  • Fits comfortably in most cloud model windows
  • Good performance/quality trade-off
Approximate token count: ~8,000-10,000 tokensCost consideration: At this limit, each request uses moderate tokens, keeping costs reasonable for pay-per-token services.
Best for:
  • GPT-4 Turbo (128K context)
  • Claude 3 Opus/Sonnet (200K context)
  • Gemini 1.5 Pro (1M+ context)
  • Specialized long-context models
Why this limit:
  • Leverages extended context capabilities
  • Provides rich, comprehensive context
  • Enables complex reasoning across many documents
Approximate token count: ~25,000-30,000 tokens
This setting significantly increases:
  • API costs (3x+ compared to “Cloud models”)
  • Request latency
  • Token usage
Only use with models that have proven long-context performance.
Best for:
  • Extreme edge cases
  • Testing and development
  • Models with multi-million token contexts
Why this limit:
  • Essentially removes the limit
  • Includes all retrieved chunks
Danger Zone: This setting can:
  • Cause requests to fail due to token limits
  • Result in extremely high API costs
  • Slow down or crash local models
  • Provide too much context, reducing quality
Not recommended for production use.
Approximate token count: Up to 750,000+ tokens

Impact on Performance

Quality

Symptoms:
  • AI lacks necessary context
  • Answers are generic or incomplete
  • Important linked information is missed
Solution: Increase to the next preset level

Speed

Processing time increases with context size:
Local (10K):     Fast    ████░░░░░░ 40%
Cloud (32K):     Medium  ██████░░░░ 60%
Advanced (100K): Slow    █████████░ 90%
Max (3M):        Slowest ██████████ 100%
Processing time includes:
  • Document extraction
  • Chunking
  • Embedding generation
  • Vector search
  • API request time

Cost

For paid APIs, token usage directly impacts cost:
PresetEst. TokensCost Multiplier
Local (10K)~2.5K1x (baseline)
Cloud (32K)~8K~3x
Advanced (100K)~25K~10x
Max (3M)VariableUp to 300x+
These multipliers apply to input tokens only. Output tokens depend on response length and are not affected by context size.

Monitoring Context Usage

In development mode, Local GPT logs context statistics:
// Console output
Passed contextLimit for context: 100000
// ...
📊 Total length of context: 87345
To enable development mode, set NODE_ENV=development in your build.

Advanced: When Context is Truncated

When the context limit is reached:
  1. Graceful Truncation: The system stops adding chunks mid-file if needed
  2. No Partial Chunks: Individual chunks are never split
  3. Highest Scores First: Within each file group, highest-scoring chunks are prioritized
  4. File Order Preserved: Newer files (by creation time) are processed first
Example scenario: With a 32K limit and 5 linked files:
  • File A (created today): 15K characters, all included
  • File B (yesterday): 12K characters, all included
  • File C (last week): 8K characters, 5K included, 3K truncated
  • File D & E: Excluded entirely

Migration from Previous Versions

If you upgraded from an earlier version, your context limit is set to "local" (10K) by default. This migration happens in src/main.ts:1102-1112:
private migrateToVersion8(settings: LocalGPTSettings): boolean {
  if (settings._version && settings._version >= 8) {
    return false;
  }
  
  (settings as any).defaults = (settings as any).defaults || {};
  (settings as any).defaults.contextLimit =
    (settings as any).defaults.contextLimit || "local";
  
  settings._version = 8;
  return true;
}
Review your context limit after upgrading to ensure it matches your AI provider’s capabilities.

Best Practices

Start Conservative

Begin with “Local models” or “Cloud models” and increase only if needed.

Match Your Model

Choose the preset that matches your AI model’s context window.

Monitor Quality

If answers lack context, increase the limit. If they’re unfocused, decrease it.

Consider Cost

Higher limits = more tokens = higher costs for paid APIs.

Next Steps

RAG System

Learn how the RAG system processes and ranks context

Troubleshooting

Fix issues with context limits and embedding

Build docs developers (and LLMs) love