Context Limits

Context limits control how much information from linked documents is included when using Enhanced Actions (RAG). Choosing the right limit is crucial for balancing answer quality, performance, and cost.

Available Presets

Local GPT offers four context limit presets, each designed for different use cases:

Local models

10,000 charactersOptimized for local models with smaller context windows (e.g., 8K-32K tokens).

Cloud models

32,000 charactersSuitable for standard cloud models with medium context windows (32K-64K tokens).

Top: GPT, Claude, Gemini

100,000 charactersFor advanced models with large context windows (100K+ tokens).

No limits (danger)

3,000,000 charactersEffectively unlimited. Use with extreme caution.

Configuration

Context limits are configured globally in the plugin settings:

Open Settings

Navigate to Settings → Local GPT → Advanced settings

Locate RAG Context

Find the Enhanced Actions section and look for RAG context

Select Preset

Choose the appropriate preset for your primary AI model

The setting is stored in settings.defaults.contextLimit with values: "local", "cloud", "advanced", or "max"

Implementation Details

The context limit is resolved in src/main.ts:783-796:

private resolveContextLimit(): number {
  const preset = this.settings?.defaults?.contextLimit as
    | "local"
    | "cloud" 
    | "advanced"
    | "max";
  const map: Record<string, number> = {
    local: 10_000,
    cloud: 32_000,
    advanced: 100_000,
    max: 3_000_000,
  };
  return map[preset];
}

The limit is enforced during context formatting in src/rag.ts:368-390:

for (const [basename, groupResults] of groups) {
  if (totalLength >= contextLimit) break;
  
  formattedResults += `[[${basename}]]\n`;
  // Add chunks until limit reached
}

How to Choose the Right Preset

Local Models (10K characters)

Best for:

Ollama models (Llama 3, Mistral, Gemma, etc.)
LM Studio
Other local inference servers
Models with 8K-32K token context windows

Why this limit:

Most local models have limited context windows
Prevents out-of-memory errors
Maintains fast inference speed
Focuses on only the most relevant chunks

Approximate token count: ~2,500-3,000 tokens (assuming 4 chars/token)

Cloud Models (32K characters)

Best for:

GPT-3.5-turbo
Claude 3 Haiku
Gemini 1.5 Flash
Standard API models

Why this limit:

Balances context richness with cost
Fits comfortably in most cloud model windows
Good performance/quality trade-off

Approximate token count: ~8,000-10,000 tokensCost consideration: At this limit, each request uses moderate tokens, keeping costs reasonable for pay-per-token services.

Advanced Models (100K characters)

Best for:

GPT-4 Turbo (128K context)
Claude 3 Opus/Sonnet (200K context)
Gemini 1.5 Pro (1M+ context)
Specialized long-context models

Why this limit:

Leverages extended context capabilities
Provides rich, comprehensive context
Enables complex reasoning across many documents

Approximate token count: ~25,000-30,000 tokens

This setting significantly increases:

API costs (3x+ compared to “Cloud models”)
Request latency
Token usage

Only use with models that have proven long-context performance.

No Limits / Max (3M characters)

Best for:

Extreme edge cases
Testing and development
Models with multi-million token contexts

Why this limit:

Essentially removes the limit
Includes all retrieved chunks

Danger Zone: This setting can:

Cause requests to fail due to token limits
Result in extremely high API costs
Slow down or crash local models
Provide too much context, reducing quality

Not recommended for production use.

Approximate token count: Up to 750,000+ tokens

Impact on Performance

Quality

Too Small
Optimal
Too Large

Symptoms:

AI lacks necessary context
Answers are generic or incomplete
Important linked information is missed

Solution: Increase to the next preset level

Speed

Processing time increases with context size:

Local (10K):     Fast    ████░░░░░░ 40%
Cloud (32K):     Medium  ██████░░░░ 60%
Advanced (100K): Slow    █████████░ 90%
Max (3M):        Slowest ██████████ 100%

Processing time includes:

Document extraction
Chunking
Embedding generation
Vector search
API request time

Cost

For paid APIs, token usage directly impacts cost:

Preset	Est. Tokens	Cost Multiplier
Local (10K)	~2.5K	1x (baseline)
Cloud (32K)	~8K	~3x
Advanced (100K)	~25K	~10x
Max (3M)	Variable	Up to 300x+

These multipliers apply to input tokens only. Output tokens depend on response length and are not affected by context size.

Monitoring Context Usage

In development mode, Local GPT logs context statistics:

// Console output
ℹ️ Passed contextLimit for context: 100000
// ...
📊 Total length of context: 87345

To enable development mode, set NODE_ENV=development in your build.

Advanced: When Context is Truncated

When the context limit is reached:

Graceful Truncation: The system stops adding chunks mid-file if needed
No Partial Chunks: Individual chunks are never split
Highest Scores First: Within each file group, highest-scoring chunks are prioritized
File Order Preserved: Newer files (by creation time) are processed first

Example scenario: With a 32K limit and 5 linked files:

File A (created today): 15K characters, all included
File B (yesterday): 12K characters, all included
File C (last week): 8K characters, 5K included, 3K truncated
File D & E: Excluded entirely

Migration from Previous Versions

If you upgraded from an earlier version, your context limit is set to "local" (10K) by default. This migration happens in src/main.ts:1102-1112:

private migrateToVersion8(settings: LocalGPTSettings): boolean {
  if (settings._version && settings._version >= 8) {
    return false;
  }
  
  (settings as any).defaults = (settings as any).defaults || {};
  (settings as any).defaults.contextLimit =
    (settings as any).defaults.contextLimit || "local";
  
  settings._version = 8;
  return true;
}

Review your context limit after upgrading to ensure it matches your AI provider’s capabilities.

Best Practices

Start Conservative

Begin with “Local models” or “Cloud models” and increase only if needed.

Match Your Model

Choose the preset that matches your AI model’s context window.

Monitor Quality

If answers lack context, increase the limit. If they’re unfocused, decrease it.

Consider Cost

Higher limits = more tokens = higher costs for paid APIs.

Next Steps

RAG System

Learn how the RAG system processes and ranks context

Troubleshooting

Fix issues with context limits and embedding

Get Started

Features

Guides

Advanced

Context Limits

Context Limits

Available Presets

Local models

Cloud models

Top: GPT, Claude, Gemini

No limits (danger)

Configuration

Implementation Details

How to Choose the Right Preset

Impact on Performance

Quality

Speed

Cost

Monitoring Context Usage

Advanced: When Context is Truncated

Migration from Previous Versions

Best Practices

Start Conservative

Match Your Model

Monitor Quality

Consider Cost

Next Steps

RAG System

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Features

Guides

Advanced

​Context Limits

​Available Presets

Local models

Cloud models

Top: GPT, Claude, Gemini

No limits (danger)

​Configuration

​Implementation Details

​How to Choose the Right Preset

​Impact on Performance

​Quality

​Speed

​Cost

​Monitoring Context Usage

​Advanced: When Context is Truncated

​Migration from Previous Versions

​Best Practices

Start Conservative

Match Your Model

Monitor Quality

Consider Cost

​Next Steps

RAG System

Troubleshooting

Build docs developers (and LLMs) love

Context Limits

Available Presets

Configuration

Implementation Details

How to Choose the Right Preset

Impact on Performance

Quality

Speed

Cost

Monitoring Context Usage

Advanced: When Context is Truncated

Migration from Previous Versions

Best Practices

Next Steps