Skip to main content
Vespa is designed to scale from single-node development setups to large production clusters serving billions of documents and thousands of queries per second.

Scaling Strategies

Horizontal Scaling

Add more nodes to distribute load and data

Vertical Scaling

Increase resources (CPU, memory, disk) on existing nodes

Content Cluster Scaling

Content clusters store and index documents. Scaling them affects both capacity and query throughput.

Adding Content Nodes

1

Update services.xml

Add new nodes to your content cluster configuration:
<content version="1.0" id="my-content">
  <redundancy>2</redundancy>
  <documents>
    <document type="mydoc" mode="index"/>
  </documents>
  <nodes>
    <node hostalias="node1" distribution-key="0"/>
    <node hostalias="node2" distribution-key="1"/>
    <node hostalias="node3" distribution-key="2"/>  <!-- New node -->
    <node hostalias="node4" distribution-key="3"/>  <!-- New node -->
  </nodes>
</content>
2

Deploy Configuration

Deploy the updated configuration:
vespa deploy
Vespa will automatically redistribute data to the new nodes.
3

Monitor Redistribution

Track the redistribution progress:
# Check bucket distribution
vespa-stat --document mydoc

# Monitor merge operations
curl http://localhost:19050/state/v1/metrics | \
  jq '.metrics.values[] | select(.name | contains("merge"))'

Data Distribution

Vespa uses bucket distribution to spread data across nodes. The system automatically:
  • Splits buckets when they grow too large
  • Joins buckets when they’re too small
  • Migrates buckets during topology changes
  • Maintains configured redundancy levels
Data redistribution happens automatically but can take time for large datasets. Plan scaling operations during low-traffic periods when possible.

Bucket Metrics During Scaling

// Monitor these metrics during scaling (from StorageMetrics.java)
VDS_DATASTORED_ALLDISKS_BUCKETS        // Total buckets
VDS_DATASTORED_BUCKET_SPACE_BUCKETS_TOTAL  // Buckets per space
VDS_MERGETHROTTLER_QUEUESIZE           // Pending merges
VDS_MERGETHROTTLER_ACTIVE_WINDOW_SIZE  // Active merges

// Bucket operations
VDS_FILESTOR_ALLTHREADS_SPLITBUCKETS_COUNT // Bucket splits
VDS_FILESTOR_ALLTHREADS_JOINBUCKETS_COUNT  // Bucket joins
VDS_FILESTOR_ALLTHREADS_MERGEBUCKETS_COUNT // Bucket merges

Container Cluster Scaling

Container nodes handle queries and feeding. They’re stateless and can be scaled easily.

Horizontal Scaling

<container version="1.0" id="default">
  <search/>
  <document-api/>
  
  <nodes>
    <node hostalias="container1"/>
    <node hostalias="container2"/>
    <node hostalias="container3"/>  <!-- Scale out -->
    <node hostalias="container4"/>  <!-- Scale out -->
  </nodes>
</container>

Load Balancing

Distribute traffic across container nodes:
nginx.conf
upstream vespa_containers {
    # Round-robin by default
    server container1:8080 weight=1;
    server container2:8080 weight=1;
    server container3:8080 weight=1;
    
    # Health check
    keepalive 32;
}

server {
    listen 80;
    
    location /search/ {
        proxy_pass http://vespa_containers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
    
    location /document/ {
        proxy_pass http://vespa_containers;
    }
}

Container Metrics for Scaling Decisions

// From ContainerMetrics.java
PEAK_QPS                // Peak queries per second
QUERIES                 // Total queries
QUERY_LATENCY           // Query latency
FAILED_QUERIES          // Failed query count

// Thread pool saturation
JDISC_THREAD_POOL_ACTIVE_THREADS    // Active threads
JDISC_THREAD_POOL_SIZE              // Pool size
JDISC_THREAD_POOL_WORK_QUEUE_SIZE   // Queue depth
JDISC_THREAD_POOL_REJECTED_TASKS    // Rejected tasks
Scale out when:
  • Thread pool utilization > 80% consistently
  • Query latency increases under load
  • Rejected tasks > 0
HTTPAPI_NUM_OPERATIONS    // Document operations/sec
HTTPAPI_PENDING           // Pending operations
HTTPAPI_QUEUED_OPERATIONS // Queued operations
HTTPAPI_FAILED_TIMEOUT    // Timeout failures
Scale out when:
  • Queued operations consistently > 1000
  • Operation timeouts increasing
  • Feed latency > 100ms for simple puts

Vertical Scaling

Increase resources on existing nodes when:
  • Individual nodes are bottlenecked
  • Adding nodes isn’t cost-effective
  • You need more memory for in-memory data structures

CPU Scaling

1

Identify CPU Bottlenecks

// Monitor CPU utilization (from SearchNodeMetrics.java)
CONTENT_PROTON_RESOURCE_USAGE_CPU_UTIL_READ     // Read operations
CONTENT_PROTON_RESOURCE_USAGE_CPU_UTIL_WRITE    // Write operations
CONTENT_PROTON_RESOURCE_USAGE_CPU_UTIL_COMPACT  // Compaction
2

Adjust Thread Pools

Configure executor threads in services.xml:
<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <summary>
        <io>
          <threads>8</threads>  <!-- Match CPU cores -->
        </io>
      </summary>
    </searchnode>
  </tuning>
</content>
3

Scale Instance Size

Increase vCPUs on your nodes and redeploy.

Memory Scaling

Content nodes use memory for:
  • Document attribute vectors
  • Memory index
  • Document store cache
  • Query execution
<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <!-- Configure memory limits -->
      <resource-limits>
        <memory>0.85</memory>  <!-- 85% memory limit -->
      </resource-limits>
    </searchnode>
  </tuning>
</content>
Memory metrics to monitor:
CONTENT_PROTON_RESOURCE_USAGE_MEMORY               // Overall memory usage
CONTENT_PROTON_DOCUMENTDB_MEMORY_USAGE_USED_BYTES  // Used memory
CONTENT_PROTON_DOCUMENTDB_ATTRIBUTE_MEMORY_USAGE_USED_BYTES
CONTENT_PROTON_RESOURCE_USAGE_FEEDING_BLOCKED      // Feeding blocked by memory
Container nodes use memory for:
  • JVM heap
  • Query result caching
  • Component instances
<container version="1.0" id="default">
  <nodes>
    <jvm options="-Xms4g -Xmx4g"/>
  </nodes>
</container>
Memory metrics to monitor:
MEM_HEAP_USED          // Heap memory used
MEM_HEAP_TOTAL         // Total heap
JDISC_GC_MS            // GC pause time
JDISC_GC_COUNT         // GC frequency

Disk Scaling

1

Monitor Disk Usage

CONTENT_PROTON_DOCUMENTDB_DISK_USAGE         // Disk usage per DB
CONTENT_PROTON_RESOURCE_USAGE_DISK           // Overall disk usage
CONTENT_PROTON_TRANSACTIONLOG_DISK_USAGE     // Transaction log
2

Optimize Storage

Enable compression and tune document store:
<content version="1.0" id="my-content">
  <tuning>
    <searchnode>
      <summary>
        <store>
          <compression>
            <type>lz4</type>
            <level>6</level>
          </compression>
        </store>
      </summary>
    </searchnode>
  </tuning>
</content>
3

Scale Disk Capacity

  • Increase disk size on existing nodes, or
  • Add more content nodes to distribute data

Redundancy and Availability

Configure redundancy to balance availability and resource usage:
<content version="1.0" id="my-content">
  <!-- Number of copies of each document -->
  <redundancy>2</redundancy>
  
  <!-- Minimum copies before marking bucket out of sync -->
  <min-redundancy>1</min-redundancy>
  
  <nodes>
    <node hostalias="node1" distribution-key="0"/>
    <node hostalias="node2" distribution-key="1"/>
    <node hostalias="node3" distribution-key="2"/>
  </nodes>
</content>

Redundancy Guidelines

Cluster SizeRecommended RedundancyNotes
1-2 nodes1Development only
3-5 nodes2Standard production
6-10 nodes2-3High availability
10+ nodes3Large scale production
Redundancy > 3 rarely provides additional benefits and increases resource usage linearly.

Scaling Patterns

Progressive Scaling

1

Start Small

Begin with minimal resources:
  • 3 content nodes (redundancy=2)
  • 2 container nodes
2

Monitor and Measure

Collect baseline metrics under realistic load
3

Scale Incrementally

Add resources based on bottlenecks:
  • High CPU → Add container nodes
  • High memory → Scale content node memory
  • Large dataset → Add content nodes
4

Validate

Test performance after each scaling operation

Geographic Distribution

Scale across regions for global applications:
<!-- Parent services.xml -->
<services>
  <container version="1.0" id="query">
    <nodes>
      <node hostalias="us-east-1"/>
      <node hostalias="us-west-1"/>
      <node hostalias="eu-west-1"/>
    </nodes>
  </container>
  
  <content version="1.0" id="global">
    <redundancy>3</redundancy>
    <nodes>
      <!-- Distribute across regions -->
      <node hostalias="us-east-1" distribution-key="0"/>
      <node hostalias="us-west-1" distribution-key="1"/>
      <node hostalias="eu-west-1" distribution-key="2"/>
    </nodes>
  </content>
</services>

Auto-Scaling Considerations

While Vespa doesn’t auto-scale automatically, you can implement auto-scaling:

Metrics-Based Auto-Scaling

# Example: Monitor metrics and trigger scaling
import requests
import time

def check_scaling_needed():
    metrics = requests.get('http://localhost:19092/metrics/v2/values').json()
    
    # Extract key metrics
    cpu_util = get_metric(metrics, 'cpu')
    query_latency = get_metric(metrics, 'query_latency.average')
    rejected_tasks = get_metric(metrics, 'jdisc.thread_pool.rejected_tasks')
    
    # Scaling logic
    if cpu_util > 80 or query_latency > 1000 or rejected_tasks > 10:
        scale_out_containers()
    elif cpu_util < 30 and query_latency < 100:
        scale_in_containers()

def scale_out_containers():
    # Update services.xml and deploy
    print("Scaling out container nodes...")
    # Implementation depends on infrastructure

def scale_in_containers():
    print("Scaling in container nodes...")

Best Practices

  • Measure before scaling: Establish baseline metrics
  • Test scaling operations: Validate in staging first
  • Plan for growth: Scale before hitting limits (70-80% capacity)
  • Consider redundancy: Factor in failure scenarios
  • Container scaling: Immediate (stateless)
  • Content node addition: Minutes to hours (data redistribution)
  • Vertical scaling: Requires restart (plan downtime or rolling upgrade)
  • Start with vertical scaling if under-provisioned
  • Use horizontal scaling for better availability
  • Monitor resource utilization to avoid over-provisioning
  • Consider tiered storage for large datasets

Next Steps

Monitoring

Set up metrics to inform scaling decisions

Tuning

Optimize performance before scaling

Troubleshooting

Debug scaling issues

Build docs developers (and LLMs) love