Troubleshooting

This guide covers common Vespa operational issues, their symptoms, root causes, and solutions.

Diagnostic Approach

Identify Symptoms

Gather error messages, metrics, and logs

Check Service Health

Verify all services are running:

vespa-model-inspect services
curl http://localhost:19050/state/v1/health

Review Metrics

Look for anomalies in key metrics

Examine Logs

Check logs for errors:

vespa-logfmt -l all -f /opt/vespa/logs/vespa/vespa.log

Apply Fix

Implement solution and verify resolution

Query Performance Issues

High Query Latency

Symptoms

Slow query responses (> 1 second)
Increasing query latency over time
Timeout errors

Check these metrics:

QUERY_LATENCY                                    // Overall latency
QUERY_CONTAINER_LATENCY                          // Container processing time
CONTENT_PROTON_DOCUMENTDB_MATCHING_QUERY_LATENCY // Content node latency
CONTENT_PROTON_DOCSUM_LATENCY                    // Document summary latency

Common Causes and Solutions

1. Inefficient Rank Profiles

# Identify slow rank profiles
curl http://localhost:19050/state/v1/metrics | \
  jq '.metrics.values[] | select(.name | contains("rank_profile.query_latency"))'

Solution:

rank-profile optimized {
  first-phase {
    expression: bm25(title)  # Fast first phase
  }
  second-phase {
    expression: lightgbm("model.json")
    rerank-count: 100  # Limit expensive reranking
  }
}

2. Thread Pool Saturation

// Check for thread pool issues
JDISC_THREAD_POOL_WORK_QUEUE_SIZE    // Growing queue
JDISC_THREAD_POOL_REJECTED_TASKS     // Rejected requests
CONTENT_PROTON_EXECUTOR_MATCH_QUEUESIZE  // Match queue depth

Solution:

<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <requestthreads>
        <count>16</count>  <!-- Increase threads -->
      </requestthreads>
    </searchnode>
  </tuning>
</content>

3. Large Result Sets Symptom: Queries with hits=1000 are slowSolution:

# Limit result size
/search/?query=foo&hits=100

# Use pagination
/search/?query=foo&hits=20&offset=20

4. Expensive Grouping Operations Solution: Optimize grouping queries:

# Before (slow)
select * from sources * where true 
  all(group(category) each(output(count())))

# After (faster)
select * from sources * where true 
  all(group(category) max(100) each(output(count())))

Query Errors

Error: Query timed out after 5000ms

Check:
- ERROR_TIMEOUT metric
- CONTENT_PROTON_DOCUMENTDB_MATCHING_SOFT_DOOMED_QUERIES

Solutions:
1. Increase query timeout
2. Optimize slow queries
3. Add more content nodes

Feeding Issues

Slow Document Ingestion

Symptoms

Low feed throughput (< 100 docs/sec on capable hardware)
High feed latency
Growing queue of pending operations

Check these metrics:

HTTPAPI_LATENCY              // Feed operation latency
HTTPAPI_PENDING              // Pending operations
HTTPAPI_QUEUED_OPERATIONS    // Queued operations
HTTPAPI_FAILED_TIMEOUT       // Timeout failures

Solutions

1. Use Async Operations

import asyncio
import aiohttp

async def feed_document(session, doc):
    async with session.post(
        'http://localhost:8080/document/v1/namespace/doctype/docid',
        json=doc,
        params={'timeout': '180s'}
    ) as response:
        return await response.json()

async def batch_feed(documents):
    async with aiohttp.ClientSession() as session:
        tasks = [feed_document(session, doc) for doc in documents]
        results = await asyncio.gather(*tasks)
        return results

2. Check Resource Limits

// Monitor feeding blocked status
CONTENT_PROTON_RESOURCE_USAGE_FEEDING_BLOCKED  // 1 = blocked
CONTENT_PROTON_RESOURCE_USAGE_MEMORY
CONTENT_PROTON_RESOURCE_USAGE_DISK

If feeding is blocked:

<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <resource-limits>
        <memory>0.90</memory>  <!-- Increase limit -->
        <disk>0.85</disk>
      </resource-limits>
    </searchnode>
  </tuning>
</content>

3. Optimize Document Structure

Remove unnecessary fields
Use appropriate field types
Enable compression for large text fields

Feed Failures

Error: Condition did not match document

Metric: HTTPAPI_CONDITION_NOT_MET

Cause: Test-and-set condition failed

Solution:
- Verify document exists
- Check condition logic
- Handle concurrent updates

Memory Issues

High Memory Usage

Critical Symptoms:

Memory usage > 90%
Feeding blocked due to memory
Frequent GC pauses
OOM errors

Identify Memory Consumers

# Check memory metrics
curl http://localhost:19050/state/v1/metrics | \
  jq '.metrics.values[] | select(.name | contains("memory_usage"))'

Key metrics:

CONTENT_PROTON_DOCUMENTDB_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_DOCUMENTDB_ATTRIBUTE_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_DOCUMENTDB_INDEX_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_MEMORY_USAGE_USED_BYTES

Check Attribute Usage

Attributes are stored in memory. Large or high-cardinality attributes consume significant memory.

# Check attribute address space usage
curl http://localhost:19050/state/v1/metrics | \
  jq '.metrics.values[] | select(.name == "content.proton.documentdb.attribute.resource_usage.address_space")'

Apply Solutions

Reduce attribute memory:

schema product {
  document product {
    # Remove attribute from large fields
    field description type string {
      indexing: summary | index  # Don't use attribute
    }
    
    # Use paged attributes for large datasets
    field tags type array<string> {
      indexing: summary | attribute
      attribute: paged  # Store on disk, page into memory
    }
  }
}

Increase memory limits:

<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <resource-limits>
        <memory>0.90</memory>
      </resource-limits>
    </searchnode>
  </tuning>
</content>

JVM Memory Issues (Containers)

High GC Overhead

// Check GC metrics
JDISC_GC_MS        // GC pause time
JDISC_GC_COUNT     // GC frequency
MEM_HEAP_USED      // Heap usage

Symptoms:

GC pauses > 500ms
GC consuming > 10% CPU
Heap consistently > 80% used

Solutions:

Increase heap size:

<container version="1.0" id="default">
  <nodes>
    <jvm options="-Xms16g -Xmx16g"/>
  </nodes>
</container>

Tune GC:

<jvm options="
  -Xms16g -Xmx16g
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=200
  -XX:InitiatingHeapOccupancyPercent=70
"/>

Identify memory leaks:

# Generate heap dump
jmap -dump:format=b,file=heap.bin <pid>

# Analyze with MAT or VisualVM

Disk Issues

Disk Full

Check Disk Usage

CONTENT_PROTON_RESOURCE_USAGE_DISK                // Overall usage
CONTENT_PROTON_DOCUMENTDB_DISK_USAGE              // Per document DB
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_DISK_USAGE
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_DISK_BLOAT
CONTENT_PROTON_TRANSACTIONLOG_DISK_USAGE

Identify Causes

High bloat:

# Check disk bloat
curl http://localhost:19050/state/v1/metrics | \
  jq '.metrics.values[] | select(.name | contains("disk_bloat"))'

Bloat > 30% indicates inefficient storage.

Apply Solutions

1. Enable compression:

<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <summary>
        <store>
          <compression>
            <type>lz4</type>
            <level>9</level>
          </compression>
        </store>
      </summary>
    </searchnode>
  </tuning>
</content>

2. Trigger compaction:

# Compact document store
vespa-visit -v -s 'id.user==1234' --fieldset '[document]' > /dev/null

3. Remove old documents:

# Remove documents by selection
curl -X DELETE 'http://localhost:8080/document/v1/namespace/doctype/docid?selection=timestamp<1640000000'

4. Add more nodes: Scale horizontally to distribute data.

Slow Disk I/O

Symptoms and Solutions

Symptoms:

High query latency
Slow feed operations
High disk queue depth

Check:

CONTENT_PROTON_DOCUMENTDB_INDEX_IO_SEARCH_READ_BYTES
CONTENT_PROTON_DOCUMENTDB_INDEX_IO_SEARCH_CACHED_READ_BYTES

Solutions:

Optimize cache usage:

<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <!-- Increase document store cache -->
      <summary>
        <store>
          <cache>
            <maxsize-percent>10</maxsize-percent>
          </cache>
        </store>
      </summary>
      
      <!-- Increase index cache -->
      <diskindexcache>
        <size>4294967296</size>  <!-- 4GB -->
      </diskindexcache>
    </searchnode>
  </tuning>
</content>

Monitor cache hit rates:

CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_CACHE_HIT_RATE
CONTENT_PROTON_INDEX_CACHE_POSTINGLIST_HIT_RATE

Target hit rate > 90%.

Use faster storage:

NVMe SSDs for best performance
Ensure sufficient IOPS provisioned

Network Issues

Connection Timeouts

Symptom: Connection timeout to container

Check:
- Container service status
- Network connectivity: ping <container-host>
- Port accessibility: telnet <container-host> 8080

Solution:
- Verify container is running
- Check firewall rules
- Review load balancer configuration

High Connection Count

// Monitor connections
SERVER_NUM_OPEN_CONNECTIONS      // Current connections
SERVER_CONNECTIONS_OPEN_MAX      // Peak connections
SERVER_CONNECTION_DURATION_MEAN  // Average duration

If connections are high:

Check for connection leaks in clients
Tune connection timeouts:

<container version="1.0" id="default">
  <http>
    <server id="default" port="8080">
      <config name="jdisc.http.connector">
        <idleTimeout>30.0</idleTimeout>
        <maxConnectionLife>300.0</maxConnectionLife>
      </config>
    </server>
  </http>
</container>

Cluster State Issues

Node Down

Identify Down Node

vespa-model-inspect services
curl http://localhost:19050/state/v1/health

Check Node Logs

# On the affected node
vespa-logfmt -l all -f /opt/vespa/logs/vespa/vespa.log | tail -100

Common Issues

Out of memory:

java.lang.OutOfMemoryError: Java heap space

Solution: Increase heap size or reduce memory usageDisk full:

No space left on device

Solution: Free disk space or add capacityConfiguration error:

Config error: Invalid configuration

Solution: Review and fix services.xml

Restart Service

# Restart Vespa on the node
vespa-stop-services
vespa-start-services

Split Brain / Cluster State Divergence

Symptoms:

Nodes report different cluster states
Inconsistent query results
Feed operations fail intermittently

Solution:

Check cluster controller:

vespa-get-cluster-state

Force cluster state update:

vespa-set-node-state -c <cluster> -t storage -i <node> up

If issues persist, restart cluster controller

Performance Regression

Diagnosis Steps

1. Compare metrics before/after:

# Export current metrics
curl http://localhost:19092/metrics/v2/values > metrics-current.json

# Compare with baseline
# Look for changes in:
# - query_latency
# - throughput (QPS)
# - resource utilization
# - error rates

2. Review recent changes:

Configuration changes
Schema modifications
Application updates
Infrastructure changes

3. Check resource saturation:

// CPU
CPU
CONTENT_PROTON_RESOURCE_USAGE_CPU_UTIL_READ
CONTENT_PROTON_RESOURCE_USAGE_CPU_UTIL_WRITE

// Memory
MEM_HEAP_USED
CONTENT_PROTON_RESOURCE_USAGE_MEMORY

// Thread pools
JDISC_THREAD_POOL_ACTIVE_THREADS
CONTENT_PROTON_EXECUTOR_MATCH_UTILIZATION

4. Profile queries:

# Enable query tracing
/search/?query=test&trace.level=5

# Analyze trace output for bottlenecks

Debugging Tools

Log Analysis

# Filter by log level
vespa-logfmt -l error,warning -f /opt/vespa/logs/vespa/vespa.log

# Filter by component
vespa-logfmt -c searchnode -f /opt/vespa/logs/vespa/vespa.log

# Follow logs in real-time
vespa-logfmt -f -F /opt/vespa/logs/vespa/vespa.log

Metric Queries

# Get specific metric
curl http://localhost:19050/state/v1/metrics | \
  jq '.metrics.values[] | select(.name == "query_latency")'

# Get all metrics for a component
curl http://localhost:19050/state/v1/metrics | \
  jq '.metrics.values[] | select(.name | startswith("content.proton"))'

Query Tracing

# Basic trace
curl 'http://localhost:8080/search/?query=test&trace.level=2'

# Detailed trace
curl 'http://localhost:8080/search/?query=test&trace.level=5'

# Trace specific components
curl 'http://localhost:8080/search/?query=test&trace.rules=rank:5'

Getting Help

Vespa Slack

Join the community for real-time help

GitHub Issues

Report bugs and request features

Stack Overflow

Search existing questions or ask new ones

Documentation

Browse official documentation

Next Steps

Monitoring

Set up proactive monitoring

Tuning

Optimize performance

Scaling

Scale your cluster

Get Started

Core Concepts

Search & Query

Data Operations

Machine Learning

Configuration & Deployment

Performance & Operations

​Diagnostic Approach

​Query Performance Issues

​High Query Latency

​Query Errors

​Feeding Issues

​Slow Document Ingestion

​Feed Failures

​Memory Issues

​High Memory Usage

​JVM Memory Issues (Containers)

​Disk Issues

​Disk Full

​Slow Disk I/O

​Network Issues

​Connection Timeouts

​High Connection Count

​Cluster State Issues

​Node Down

​Split Brain / Cluster State Divergence

​Performance Regression

​Debugging Tools

​Log Analysis

​Metric Queries

​Query Tracing

​Getting Help

Vespa Slack

GitHub Issues

Stack Overflow

Documentation

​Next Steps

Monitoring

Tuning

Scaling

Build docs developers (and LLMs) love

Diagnostic Approach

Query Performance Issues

High Query Latency

Query Errors

Feeding Issues

Slow Document Ingestion

Feed Failures

Memory Issues

High Memory Usage

JVM Memory Issues (Containers)

Disk Issues

Disk Full

Slow Disk I/O

Network Issues

Connection Timeouts

High Connection Count

Cluster State Issues

Node Down

Split Brain / Cluster State Divergence

Performance Regression

Debugging Tools

Log Analysis

Metric Queries

Query Tracing

Getting Help

Next Steps