Tuning Vespa involves optimizing resource allocation, thread pools, caching, and query execution to maximize throughput and minimize latency.
Query Performance Optimize search and ranking latency
Feed Performance Maximize document indexing throughput
Resource Efficiency Optimize CPU, memory, and disk usage
Thread Pool Configuration
Optimize container thread pools for query handling:
< container version = "1.0" id = "default" >
< search />
<!-- Configure search handler threads -->
< handler id = "com.yahoo.search.handler.SearchHandler" >
< binding > http://*/search/* </ binding >
</ handler >
< nodes >
<!-- JVM heap tuning -->
< jvm options = "-Xms8g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=200" />
</ nodes >
</ container >
Content Node Tuning
Configure executor threads on content nodes:
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
<!-- Match (search) executor threads -->
< requestthreads >
< count > 8 </ count > <!-- Number of search threads -->
< persearch > 2 </ persearch > <!-- Threads per search -->
</ requestthreads >
<!-- Summary (docsum) threads -->
< summary >
< io >
< threads > 8 </ threads > <!-- Summary fetch threads -->
</ io >
</ summary >
</ searchnode >
</ tuning >
< nodes >
< node hostalias = "node1" distribution-key = "0" />
< node hostalias = "node2" distribution-key = "1" />
</ nodes >
</ content >
// Executor metrics to watch (from SearchNodeMetrics.java)
CONTENT_PROTON_EXECUTOR_MATCH_QUEUESIZE // Match queue depth
CONTENT_PROTON_EXECUTOR_MATCH_UTILIZATION // Match thread utilization
CONTENT_PROTON_EXECUTOR_DOCSUM_QUEUESIZE // Docsum queue depth
CONTENT_PROTON_EXECUTOR_DOCSUM_UTILIZATION // Docsum utilization
// Threading service per document DB
CONTENT_PROTON_DOCUMENTDB_THREADING_SERVICE_MASTER_QUEUESIZE
CONTENT_PROTON_DOCUMENTDB_THREADING_SERVICE_INDEX_QUEUESIZE
CONTENT_PROTON_DOCUMENTDB_THREADING_SERVICE_SUMMARY_QUEUESIZE
Optimal thread count = Number of CPU cores. Start here and adjust based on utilization metrics.
Rank Profile Optimization
Use Phased Ranking
Optimize expensive ranking with two phases: rank-profile optimized {
first-phase {
expression: bm25(title) + bm25(body)
}
second-phase {
expression: xgboost("model.json")
rerank-count: 100 # Only rerank top 100
}
}
Monitor Ranking Metrics
// From SearchNodeMetrics.java
CONTENT_PROTON_DOCUMENTDB_MATCHING_DOCS_MATCHED // First phase
CONTENT_PROTON_DOCUMENTDB_MATCHING_DOCS_RANKED // First phase ranked
CONTENT_PROTON_DOCUMENTDB_MATCHING_DOCS_RERANKED // Second phase
// Per rank profile
CONTENT_PROTON_DOCUMENTDB_MATCHING_RANK_PROFILE_QUERY_LATENCY
CONTENT_PROTON_DOCUMENTDB_MATCHING_RANK_PROFILE_RERANK_TIME
Optimize Match Phase
Limit first-phase matching for low-relevance documents: rank-profile with-match-phase inherits default {
match-phase {
attribute: quality_score
max-hits: 10000
max-filter-coverage: 0.5
}
}
Attribute vs Index Trade-offs
Make attributes searchable for better performance: schema product {
document product {
field category type string {
indexing: summary | attribute
attribute: fast-search # Enable fast searching
}
field price type int {
indexing: summary | attribute
attribute: fast-search
}
}
}
When to use fast-search:
Low cardinality fields (< 10,000 unique values)
Frequently used in filters or grouping
Need fast counting/aggregation
Memory impact: // Monitor attribute memory
CONTENT_PROTON_DOCUMENTDB_ATTRIBUTE_MEMORY_USAGE_ALLOCATED_BYTES
CONTENT_PROTON_DOCUMENTDB_ATTRIBUTE_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_DOCUMENTDB_ATTRIBUTE_RESOURCE_USAGE_ADDRESS_SPACE
Document Processing
Optimize feeding throughput:
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
<!-- Flush tuning -->
< index >
< io >
< write > directio </ write >
</ io >
</ index >
<!-- Memory index flush threshold -->
< index >
< maxflushed > 2 </ maxflushed >
</ index >
</ searchnode >
</ tuning >
</ content >
Feed Metrics
Container Feed Metrics
Content Node Feed Metrics
// From ContainerMetrics.java
FEED_OPERATIONS // Total feed operations
FEED_LATENCY // Feed latency
FEED_HTTP_REQUESTS // Feed HTTP requests
HTTPAPI_NUM_PUTS // Put operations
HTTPAPI_NUM_UPDATES // Update operations
HTTPAPI_NUM_REMOVES // Remove operations
HTTPAPI_LATENCY // Operation latency
Batch Feeding
Optimize for high-throughput ingestion:
import requests
import json
def batch_feed ( documents , batch_size = 1000 ):
"""Feed documents in batches for optimal throughput"""
url = "http://localhost:8080/document/v1/namespace/doctype/docid"
for i in range ( 0 , len (documents), batch_size):
batch = documents[i:i + batch_size]
# Use async operations
for doc in batch:
requests.post(
url,
json = doc,
params = { 'timeout' : '180s' }, # Increase timeout
headers = { 'Content-Type' : 'application/json' }
)
Feed rate limits : Monitor HTTPAPI_FAILED_TIMEOUT and CONTENT_PROTON_RESOURCE_USAGE_FEEDING_BLOCKED to detect throttling.
Memory Optimization
Content Node Memory
Configure Memory Limits
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
< resource-limits >
< memory > 0.85 </ memory > <!-- 85% threshold -->
</ resource-limits >
</ searchnode >
</ tuning >
</ content >
Optimize Document Store Cache
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
< summary >
< store >
< cache >
< maxsize > 1073741824 </ maxsize > <!-- 1GB cache -->
< maxsize-percent > 5 </ maxsize-percent >
</ cache >
</ store >
</ summary >
</ searchnode >
</ tuning >
</ content >
Monitor cache effectiveness: CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_CACHE_HIT_RATE
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_CACHE_MEMORY_USAGE
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_CACHE_ELEMENTS
Monitor Memory Pressure
# Check memory metrics
curl http://localhost:19050/state/v1/metrics | \
jq '.metrics.values[] | select(.name | contains("memory"))'
JVM Tuning (Container Nodes)
< container version = "1.0" id = "default" >
< nodes >
< jvm options = "
-Xms8g -Xmx8g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=70
-XX:+ParallelRefProcEnabled
-XX:MaxTenuringThreshold=8
" />
</ nodes >
</ container >
GC Metrics to Monitor:
JDISC_GC_COUNT // GC frequency
JDISC_GC_MS // GC pause time
MEM_HEAP_USED // Heap utilization
Heap size : Set to 50-75% of container node RAM
GC algorithm : Use G1GC for heaps > 4GB
Pause target : 200ms is reasonable for most applications
Monitor : GC should be < 5% of total CPU time
Storage Configuration
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
<!-- Document store tuning -->
< summary >
< store >
<!-- Compression -->
< compression >
< type > lz4 </ type >
< level > 6 </ level > <!-- 1-9, higher = more compression -->
</ compression >
<!-- File size -->
< logstore >
< maxfilesize > 4000000000 </ maxfilesize > <!-- 4GB files -->
</ logstore >
</ store >
</ summary >
<!-- Index tuning -->
< index >
< io >
< write > directio </ write > <!-- Bypass OS cache -->
< read > directio </ read >
</ io >
</ index >
</ searchnode >
</ tuning >
</ content >
Disk Metrics
// Monitor disk usage (from SearchNodeMetrics.java)
CONTENT_PROTON_DOCUMENTDB_DISK_USAGE // Total disk usage
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_DISK_USAGE
CONTENT_PROTON_DOCUMENTDB_READY_DOCUMENT_STORE_DISK_BLOAT
CONTENT_PROTON_DOCUMENTDB_INDEX_DISK_USAGE
// Transaction log
CONTENT_PROTON_TRANSACTIONLOG_DISK_USAGE
CONTENT_PROTON_TRANSACTIONLOG_ENTRIES
Index Cache Tuning
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
<!-- Posting list cache -->
< diskindexcache >
< size > 2147483648 </ size > <!-- 2GB -->
</ diskindexcache >
</ searchnode >
</ tuning >
</ content >
Monitor cache performance:
CONTENT_PROTON_INDEX_CACHE_POSTINGLIST_HIT_RATE
CONTENT_PROTON_INDEX_CACHE_POSTINGLIST_MEMORY_USAGE
CONTENT_PROTON_INDEX_CACHE_BITVECTOR_HIT_RATE
Network Optimization
Connection Tuning
< container version = "1.0" id = "default" >
< http >
< server id = "default" port = "8080" >
< config name = "jdisc.http.connector" >
< maxConnectionLife > 300.0 </ maxConnectionLife >
< idleTimeout > 60.0 </ idleTimeout >
</ config >
</ server >
</ http >
</ container >
Connection Metrics
// From ContainerMetrics.java
SERVER_NUM_OPEN_CONNECTIONS // Current open connections
SERVER_NUM_CONNECTIONS // Total connections
SERVER_CONNECTIONS_OPEN_MAX // Max concurrent connections
SERVER_CONNECTION_DURATION_MEAN // Average connection duration
Query Timeout Configuration
< container version = "1.0" id = "default" >
< search >
< chain id = "default" inherits = "vespa" >
< searcher id = "com.yahoo.search.searchers.TimeoutSearcher" >
< config name = "search.searchers.timeout" >
< timeout > 5.0 </ timeout > <!-- 5 second query timeout -->
</ config >
</ searcher >
</ chain >
</ search >
</ container >
Monitor timeout-related metrics:
ERROR_TIMEOUT // Timeout errors
CONTENT_PROTON_DOCUMENTDB_MATCHING_SOFT_DOOMED_QUERIES // Soft timeouts
QUERY_TIMEOUT // Configured timeout
Resource Prioritization
Feeding vs Queries
Balance resource allocation:
< content version = "1.0" id = "my-content" >
< tuning >
< searchnode >
<!-- Lower feeding priority during peak query times -->
< feeding >
< concurrency > 0.5 </ concurrency > <!-- 50% of capacity -->
</ feeding >
</ searchnode >
</ tuning >
</ content >
Benchmarking and Testing
Load Testing
Establish Baseline
Measure performance before tuning: # Use vespa-fbench or custom load generator
vespa-fbench -n 100 -q queries.txt -s 30 -c 10 localhost 8080
Apply Tuning
Make one configuration change at a time
Measure Impact
Compare metrics:
Query latency (p50, p95, p99)
Throughput (QPS)
Resource utilization
Error rates
Iterate
Continue tuning based on results
Avoid these common mistakes:
Over-provisioning threads : More threads != better performance
Ignoring cache metrics : Poor cache hit rates waste resources
Synchronous feeding : Always use async operations for high throughput
No query timeouts : Can cause resource exhaustion
Tuning without measuring : Always benchmark before and after changes
Next Steps
Monitoring Track performance metrics
Scaling Scale resources when tuning isn’t enough
Troubleshooting Debug performance issues