Performance Optimization

Memory Usage

High memory consumption by backend services

Problem: Backend processes consuming excessive RAM (>4GB).Solution:

Monitor memory usage:

# Check process memory
ps aux | grep -E "langgraph|uvicorn|python" | awk '{print $4"% "$11}'

# Docker container memory
docker stats --no-stream

# System memory
free -h  # Linux
vm_stat  # macOS

Enable context summarization:

# In config.yaml
summarization:
  enabled: true  # ✅ Enable to reduce memory
  trigger:
    - type: tokens
      value: 15564  # Trigger when reaching ~16K tokens
  keep:
    type: messages
    value: 10  # Keep recent 10 messages only
  trim_tokens_to_summarize: 15564

Reduce memory injection:

# In config.yaml
memory:
  enabled: true
  max_injection_tokens: 2000  # ✅ Limit memory in prompt
  max_facts: 100  # Limit stored facts
  fact_confidence_threshold: 0.7  # Only high-confidence facts

Limit concurrent subagents:

# In backend/src/subagents/executor.py (default: 3)
MAX_CONCURRENT_SUBAGENTS = 2  # Reduce from 3 to 2

Configure thread pool sizes:

# In backend/src/subagents/executor.py
_scheduler_pool = ThreadPoolExecutor(max_workers=2)  # Default: 3
_execution_pool = ThreadPoolExecutor(max_workers=2)  # Default: 3

Use lightweight models for non-critical tasks:

models:
  - name: gpt-4  # Primary model for agent
    # ...
  - name: gpt-4o-mini  # Lightweight for summarization/memory
    use: langchain_openai:ChatOpenAI
    model: gpt-4o-mini
    max_tokens: 2048

# Use lightweight model for background tasks
summarization:
  model_name: gpt-4o-mini
memory:
  model_name: gpt-4o-mini
title:
  model_name: gpt-4o-mini

Memory leak in long-running sessions

Problem: Memory usage grows continuously over multiple conversations.Solution:

Clear conversation history periodically:

# Start fresh thread for new task
# Frontend: Click "New Conversation"
# API: Use new thread_id

Enable aggressive summarization:

summarization:
  enabled: true
  trigger:
    - type: messages
      value: 20  # Summarize after 20 messages (more aggressive)
  keep:
    type: messages
    value: 5  # Keep only 5 recent messages

Clean up old threads:

# Remove old thread data
find backend/.deer-flow/threads -type d -mtime +7 -exec rm -rf {} \;
# Removes threads older than 7 days

Restart services periodically:

# Automated restart script
make stop
sleep 5
make dev

Monitor and set memory limits (Docker):

# In docker-compose-dev.yaml
services:
  langgraph:
    deploy:
      resources:
        limits:
          memory: 4G  # Limit to 4GB
        reservations:
          memory: 2G  # Reserve 2GB
  gateway:
    deploy:
      resources:
        limits:
          memory: 2G

Out of memory errors in sandbox containers

Problem: OOMKilled or memory errors in sandbox containers.Solution:

Increase Docker memory limit:

# Docker Desktop: Settings → Resources → Memory
# Increase to at least 4GB (8GB recommended)

Configure container resource limits:

# In config.yaml for Docker sandbox
sandbox:
  use: src.community.aio_sandbox:AioSandboxProvider
  # Add resource limits (if supported by your setup)
  memory_limit: "2g"  # 2GB per sandbox
  cpu_limit: 2  # 2 CPU cores

For Kubernetes provisioner mode:

# In docker/provisioner/app/provisioner.py
# Adjust Pod resource limits:
resources=client.V1ResourceRequirements(
    requests={"memory": "512Mi", "cpu": "500m"},
    limits={"memory": "2Gi", "cpu": "2000m"}  # Increase limits
)

Clean up sandbox artifacts:

# Clear outputs from old threads
find backend/.deer-flow/threads -type f -name "*.png" -mtime +1 -delete
find backend/.deer-flow/threads -type f -name "*.pdf" -mtime +1 -delete

Limit file sizes in sandbox:

# Agent instruction in system prompt
# "Keep generated files under 10MB each"
# "Use compression for large data files"

Context Window Optimization

Context length exceeded errors

Problem: Context length exceeded or Maximum token limit errors.Solution:

Enable summarization (most important):

# In config.yaml
summarization:
  enabled: true  # ✅ Critical for long conversations
  trigger:
    - type: fraction
      value: 0.8  # Trigger at 80% of model's limit
  keep:
    type: messages
    value: 10

Use models with larger context windows:

models:
  # GPT-4 Turbo: 128K context
  - name: gpt-4-turbo
    use: langchain_openai:ChatOpenAI
    model: gpt-4-turbo-preview

  # Claude 3.5 Sonnet: 200K context
  - name: claude-3.5-sonnet
    use: langchain_anthropic:ChatAnthropic
    model: claude-3-5-sonnet-20241022

  # Gemini 2.5 Pro: 1M context
  - name: gemini-2.5-pro
    use: langchain_google_genai:ChatGoogleGenerativeAI
    model: gemini-2.5-pro

Reduce memory injection tokens:

memory:
  max_injection_tokens: 1000  # Reduce from 2000

Limit skill injection:

// In extensions_config.json
// Disable unused skills to reduce prompt size
{
  "skills": {
    "research": {"enabled": true},
    "report-generation": {"enabled": true},
    "image-generation": {"enabled": false},  // Disable if not needed
    "video-generation": {"enabled": false},
    "slide-creation": {"enabled": false}
  }
}

Optimize tool descriptions:

# Keep tool descriptions concise
# Tools with long descriptions add to context
# Edit tool docstrings in backend/src/sandbox/tools.py

Use subagents for isolated context:

# Each subagent has its own context
# Break complex tasks into subagents instead of long single conversation
# Agent will automatically delegate when appropriate

Slow response times due to large context

Problem: Model takes >30 seconds to respond as conversation grows.Solution:

Aggressive summarization:

summarization:
  enabled: true
  trigger:
    - type: tokens
      value: 8000  # Lower threshold for faster response
    - type: messages
      value: 15  # Or after 15 messages
  keep:
    type: messages
    value: 5  # Keep fewer messages

Start new thread for new topics:

# Don't try to do everything in one thread
# Use frontend "New Conversation" button
# Or create new thread_id via API

Offload data to files:

# Instead of keeping large data in conversation:
# 1. Write to /mnt/user-data/workspace/data.json
# 2. Reference file path in subsequent messages
# 3. Read only what's needed

Disable thinking mode for simple queries:

# Thinking mode adds significant tokens
# Only enable for complex reasoning tasks
models:
  - name: deepseek-v3
    supports_thinking: true
    # Manually toggle thinking via frontend toggle

Sandbox Performance

Slow container startup times

Problem: Sandbox container takes >10 seconds to start.Solution:

Pre-pull sandbox image (most important):

# Run before first use
make setup-sandbox

# Or manually:
docker pull enterprise-public-cn-beijing.cr.volces.com/vefaas-public/all-in-one-sandbox:latest

Use Apple Container on macOS (faster than Docker):

# Install Apple Container
# Download from: https://github.com/apple/container/releases

# DeerFlow automatically uses it if available
container --version
container system start

Keep existing sandbox instead of recreating:

# In config.yaml - use existing sandbox URL
sandbox:
  use: src.community.aio_sandbox:AioSandboxProvider
  base_url: http://localhost:8080  # Reuse existing sandbox
  auto_start: false  # Don't start new containers

Optimize Docker storage driver:

# Check current driver
docker info | grep "Storage Driver"

# overlay2 is fastest (default on modern systems)
# If using devicemapper or aufs, consider migration

Use local sandbox for development:

# Fastest option - no container overhead
sandbox:
  use: src.sandbox.local:LocalSandboxProvider
# Note: Less isolation, use only for development

Slow file operations in sandbox

Problem: Reading/writing files in sandbox is sluggish.Solution:

Check mount type (macOS):

# macOS: Avoid volumes, use bind mounts
# Already configured correctly in DeerFlow
# Verify in docker inspect output

Reduce file I/O:

# Read files once and cache in conversation
# Avoid repeated read_file calls for same file

# Write batch files together:
# - Better: Write all outputs at end
# - Worse: Write each file individually during processing

Use smaller files:

# Large files (>100MB) slow down sandbox
# Compress before uploading
# Split large datasets into chunks

For Kubernetes provisioner - use local paths:

# Ensure SKILLS_HOST_PATH and THREADS_HOST_PATH
# point to local SSD, not network storage

Optimize Docker Desktop settings (macOS):

# Docker Desktop → Settings → Resources
# Use VirtioFS instead of gRPC FUSE (faster)
# Enable "Use new Virtualization framework"

Bash commands execute slowly in sandbox

Problem: Shell commands take longer than expected in sandbox.Solution:

Increase container CPU allocation:

# Docker Desktop → Settings → Resources
# Increase CPUs to 4+ (from default 2)

Use local sandbox for CPU-intensive tasks:

# For development/testing
sandbox:
  use: src.sandbox.local:LocalSandboxProvider

Optimize commands:

# Avoid: Slow commands
find / -name "*.py"  # Searches entire filesystem

# Better: Targeted commands
find /mnt/user-data -name "*.py"  # Only search workspace

Use bash agent for complex command sequences:

# Instead of multiple bash tool calls:
# Delegate to bash subagent for multi-step shell tasks
# Bash agent is optimized for command execution

Scaling Considerations

Multiple concurrent users or threads

Problem: Performance degrades with multiple simultaneous conversations.Solution:

Use provisioner mode with Kubernetes:

# In config.yaml
sandbox:
  use: src.community.aio_sandbox:AioSandboxProvider
  provisioner_url: http://provisioner:8002
  # Each thread gets isolated Pod

Configure resource limits per thread:

# In docker/provisioner/app/provisioner.py
# Set appropriate limits based on your cluster capacity
resources=client.V1ResourceRequirements(
    requests={"memory": "256Mi", "cpu": "250m"},  # Per sandbox
    limits={"memory": "1Gi", "cpu": "1000m"}
)

Implement request queuing (advanced):

# Add rate limiting in backend/src/gateway/app.py
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.post("/api/chat")
@limiter.limit("10/minute")  # Max 10 requests per minute
async def chat(request: Request):
    # ...

Use multiple model providers:

# Distribute load across providers
models:
  - name: openai-gpt4
    use: langchain_openai:ChatOpenAI
    # ...
  - name: anthropic-claude
    use: langchain_anthropic:ChatAnthropic
    # ...
  - name: google-gemini
    use: langchain_google_genai:ChatGoogleGenerativeAI
    # ...

Horizontal scaling (advanced):

# Run multiple backend instances behind load balancer
# Use Redis for shared state/checkpointing
# Configure in docker-compose or Kubernetes

Database/storage bottlenecks (memory.json, threads)

Problem: Slow memory updates or thread data access.Solution:

Increase memory debounce time:

# In config.yaml
memory:
  debounce_seconds: 60  # Increase from 30 to reduce writes

Use SSD for thread storage:

# Ensure backend/.deer-flow/ is on SSD, not HDD
# Check with:
df -Th backend/.deer-flow/

Clean up old threads regularly:

# Cron job to clean threads older than 30 days
0 2 * * * find /path/to/backend/.deer-flow/threads -type d -mtime +30 -exec rm -rf {} \;

Limit fact storage:

memory:
  max_facts: 50  # Reduce from 100
  fact_confidence_threshold: 0.8  # Higher threshold = fewer facts

Use external database (advanced):

# Replace file-based memory with PostgreSQL/MongoDB
# Implement custom memory backend in backend/src/agents/memory/
# Use LangGraph's PostgresSaver for checkpointing

Network latency to LLM providers

Problem: Slow responses due to network issues.Solution:

Use geographically closer endpoints:

# For providers with regional endpoints
models:
  - name: gpt-4
    use: langchain_openai:ChatOpenAI
    base_url: https://api.openai.com/v1  # Default
    # Some providers offer regional endpoints

Increase timeout for slow connections:

models:
  - name: gpt-4
    timeout: 120  # Increase from 60 to 120 seconds
    max_retries: 3  # Retry on timeout

Use local models (Ollama, LM Studio):

models:
  - name: ollama-local
    use: langchain_openai:ChatOpenAI
    model: llama3.2
    base_url: http://localhost:11434/v1
    api_key: ollama
    # Zero network latency

Implement caching (advanced):

# Cache LLM responses for repeated queries
# Use LangChain's caching layer:
from langchain.cache import InMemoryCache, SQLiteCache
from langchain.globals import set_llm_cache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

Monitor provider status:

# Check provider status pages
# OpenAI: https://status.openai.com/
# Anthropic: https://status.anthropic.com/
# If provider is degraded, switch to backup model

Performance Monitoring

How to monitor and profile DeerFlow performance

Solution:

Enable detailed logging:

# In backend/src/ files
import logging
logging.basicConfig(level=logging.DEBUG)
# Check logs/backend.log for bottlenecks

Monitor resource usage:

# Real-time monitoring
watch -n 1 'ps aux | grep -E "langgraph|uvicorn" | grep -v grep'

# Docker stats
docker stats

# System monitoring
htop  # Linux
Activity Monitor  # macOS

Track API response times:

# Add timing to requests
time curl -X POST http://localhost:2026/api/langgraph/threads/xxx/runs/stream \
  -H "Content-Type: application/json" \
  -d '{...}'

Profile Python code (advanced):

# Add profiling to backend
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()
# ... code to profile ...
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 slowest functions

Use LangSmith for LLM tracing (advanced):

# Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-langsmith-key
export LANGCHAIN_PROJECT=deerflow

# Restart DeerFlow
# View traces at https://smith.langchain.com

Benchmark specific operations:

# Test sandbox performance
time docker exec <sandbox-container> bash -c "for i in {1..100}; do echo test > /tmp/test_$i.txt; done"

# Test model latency
cd backend
time uv run python -c "from langchain_openai import ChatOpenAI; llm = ChatOpenAI(model='gpt-4'); print(llm.invoke('test').content)"

Next Steps

Common Issues - General troubleshooting
Sandbox Errors - Container-specific issues
Configuration Guide - Optimize your configuration
Architecture - Understand system design for better optimization

Architecture

Development

Troubleshooting

Performance Optimization

Memory Usage

Context Window Optimization

Sandbox Performance

Scaling Considerations

Performance Monitoring

Next Steps

Build docs developers (and LLMs) love

Architecture

Development

Troubleshooting

​Memory Usage

​Context Window Optimization

​Sandbox Performance

​Scaling Considerations

​Performance Monitoring

​Next Steps

Build docs developers (and LLMs) love

Memory Usage

Context Window Optimization

Sandbox Performance

Scaling Considerations

Performance Monitoring

Next Steps