Skip to main content

Overview

Load testing helps you understand your system’s performance characteristics under various loads. This guide covers three popular tools for load testing ML APIs.
Important: Always test in a staging environment first. Never run load tests against production without proper planning.

Load Testing Tools

Locust

Python-based load testing tool with a web UI for monitoring. Pros:
  • Easy to write tests in Python
  • Real-time web dashboard
  • Distributed load generation
  • Great for complex user behaviors
Cons:
  • Higher resource overhead
  • Slower than Go-based tools

k6

JavaScript-based load testing tool from Grafana. Pros:
  • High performance (written in Go)
  • Beautiful web dashboard
  • CLI and CI/CD friendly
  • Extensive metrics and reporting
Cons:
  • Learning curve for JavaScript
  • Less flexible than Python

Vegeta

Minimalist HTTP load testing tool written in Go. Pros:
  • Extremely fast and lightweight
  • Simple command-line interface
  • Easy to run in Kubernetes
  • Low resource consumption
Cons:
  • Limited to simple HTTP scenarios
  • No web UI
  • Less flexible for complex tests

Locust Example

Installation

pip install locust

Test Script

Create locustfile.py:
import numpy as np
from locust import HttpUser, between, task

movie_reviews = [
    "A rollercoaster of emotions with stunning visuals and remarkable performances. A must-see!",
    "Despite its high production values, the plot is predictable and lacks originality.",
    "An epic space opera that pulls you in with its intricate plot and complex characters.",
    "Too reliant on CGI, and the storyline feels disjointed and hard to follow.",
    "An extraordinary cinematic experience that beautifully captures the human spirit.",
    "The pacing is too slow, and it tends to feel more like a documentary than a feature film.",
    "A superb adaptation with a gripping plot and fascinating characters. Truly unforgettable.",
    "Though the scenery is beautiful, the characters feel flat and the storyline lacks depth.",
    "A touching story of love and loss, paired with phenomenal acting. It will leave you teary-eyed.",
    "The script is clichéd, and the chemistry between the lead actors feels forced.",
    "A thrilling and suspenseful journey that keeps you on the edge of your seat till the end.",
    "The plot twists feel contrived, and the horror elements seem more comical than scary.",
    "A poignant exploration of life and love, combined with a mesmerizing soundtrack.",
    "The narrative is overly sentimental and fails to deliver a strong message.",
    "An underwater adventure that's both visually stunning and emotionally resonant.",
    "The visual effects overshadow the story, which is lacking in depth and originality.",
    "An action-packed thrill ride with memorable characters and an engaging plot.",
    "The action scenes are overdone and the storyline is paper thin.",
    "A captivating sci-fi thriller that challenges your perception of reality.",
    "The plot is confusing and the ending leaves too many questions unanswered.",
]


class PredictUser(HttpUser):
    wait_time = between(1, 5)

    @task
    def predict(self):
        num_of_review = np.random.randint(1, 100)
        reviews = np.random.choice(movie_reviews, size=num_of_review, replace=True)
        self.client.post("/predict", json={"text": reviews.tolist()})

Run Test

locust -f locustfile.py \
  --host=http://0.0.0.0:8080 \
  --users 50 \
  --spawn-rate 10 \
  --autostart \
  --run-time 600s

Parameters

  • --host: Target API endpoint
  • --users: Number of concurrent users
  • --spawn-rate: Users spawned per second
  • --run-time: Test duration
  • --autostart: Start immediately without web UI

Web UI

For interactive testing, run without --autostart:
locust -f locustfile.py --host=http://0.0.0.0:8080
Open http://localhost:8089 to access the dashboard.

k6 Example

Installation

# macOS
brew install k6

# Ubuntu
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install k6

Test Script

Create load_test.js:
import http from 'k6/http';
import { sleep } from 'k6';

const movie_reviews = [
    "A rollercoaster of emotions with stunning visuals and remarkable performances. A must-see!",
    "Despite its high production values, the plot is predictable and lacks originality.",
    "An epic space opera that pulls you in with its intricate plot and complex characters.",
    "Too reliant on CGI, and the storyline feels disjointed and hard to follow.",
    "An extraordinary cinematic experience that beautifully captures the human spirit.",
    "The pacing is too slow, and it tends to feel more like a documentary than a feature film.",
    "A superb adaptation with a gripping plot and fascinating characters. Truly unforgettable.",
    "Though the scenery is beautiful, the characters feel flat and the storyline lacks depth.",
    "A touching story of love and loss, paired with phenomenal acting. It will leave you teary-eyed.",
    "The script is clichéd, and the chemistry between the lead actors feels forced.",
    "A thrilling and suspenseful journey that keeps you on the edge of your seat till the end.",
    "The plot twists feel contrived, and the horror elements seem more comical than scary.",
    "A poignant exploration of life and love, combined with a mesmerizing soundtrack.",
    "The narrative is overly sentimental and fails to deliver a strong message.",
    "An underwater adventure that's both visually stunning and emotionally resonant.",
    "The visual effects overshadow the story, which is lacking in depth and originality.",
    "An action-packed thrill ride with memorable characters and an engaging plot.",
    "The action scenes are overdone and the storyline is paper thin.",
    "A captivating sci-fi thriller that challenges your perception of reality.",
    "The plot is confusing and the ending leaves too many questions unanswered.",
];

export let options = {
    vus: 10,
    duration: '10m',
};

export default function () {
    sleep(Math.random() * 4 + 1);
    const num_of_review = Math.floor(Math.random() * 100) + 1;
    const reviews = [];
    for (let i = 0; i < num_of_review; i++) {
        const random_index = Math.floor(Math.random() * movie_reviews.length);
        reviews.push(movie_reviews[random_index]);
    }
    const payload = JSON.stringify({ text: reviews });
    const params = {
        headers: {
            'Content-Type': 'application/json',
        },
    };
    http.post('http://0.0.0.0:8080/predict', payload, params);
}

Run Test

K6_WEB_DASHBOARD=true k6 run ./load_test.js
The dashboard will be available at http://localhost:5665

Advanced Configuration

export let options = {
    stages: [
        { duration: '2m', target: 10 },  // Ramp up to 10 users
        { duration: '5m', target: 10 },  // Stay at 10 users
        { duration: '2m', target: 50 },  // Ramp up to 50 users
        { duration: '5m', target: 50 },  // Stay at 50 users
        { duration: '2m', target: 0 },   // Ramp down to 0 users
    ],
    thresholds: {
        http_req_duration: ['p(95)<500'], // 95% of requests must complete below 500ms
        http_req_failed: ['rate<0.1'],    // Error rate must be below 10%
    },
};

Vegeta Example

Installation

# macOS
brew install vegeta

# Or use Docker
docker pull peterevans/vegeta

Command Line Usage

echo "POST http://localhost:8080/predict" | \
  vegeta attack -rate=100/s -duration=60s -body=payload.json | \
  vegeta report

Kubernetes Deployment

Deploy as a Kubernetes Job:
vegeta-job.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: vegeta-cfg
data:
  cfg: |
    POST http://app-fastapi.default.svc.cluster.local:8080/predict
    Content-Type: application/json
    @/var/vegeta/payload
  payload: |
    {
      "text": [
        "A rollercoaster of emotions with stunning visuals and remarkable performances. A must-see!",
        "Despite its high production values, the plot is predictable and lacks originality.",
        "An epic space opera that pulls you in with its intricate plot and complex characters.",
        "Too reliant on CGI, and the storyline feels disjointed and hard to follow.",
        "An extraordinary cinematic experience that beautifully captures the human spirit.",
        "The pacing is too slow, and it tends to feel more like a documentary than a feature film.",
        "A superb adaptation with a gripping plot and fascinating characters. Truly unforgettable.",
        "Though the scenery is beautiful, the characters feel flat and the storyline lacks depth.",
        "A touching story of love and loss, paired with phenomenal acting. It will leave you teary-eyed.",
        "The script is clichéd, and the chemistry between the lead actors feels forced."
      ]
    }
---    
apiVersion: batch/v1
kind: Job
metadata:
  generateName: load-test-
spec:
  backoffLimit: 6
  parallelism: 1
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: OnFailure
      containers:
      - name: vegeta
        image: peterevans/vegeta:latest
        imagePullPolicy: Always
        command:
        - sh
        - -c
        args:
        - vegeta -cpus=2 attack -duration=1m -rate=100/1s -targets=/var/vegeta/cfg | vegeta report -type=text
        volumeMounts:
        - name: vegeta-cfg
          mountPath: /var/vegeta
      volumes:
      - name: vegeta-cfg
        configMap:
          name: vegeta-cfg
          defaultMode: 420
Deploy:
kubectl create -f vegeta-job.yaml
View results:
kubectl logs -f job/load-test-<suffix>

Performance Metrics

Key Metrics to Track

MetricDescriptionGood Target
Response Time (p50)Median latency< 100ms
Response Time (p95)95th percentile< 500ms
Response Time (p99)99th percentile< 1000ms
ThroughputRequests per secondDepends on model
Error RateFailed requests< 1%
CPU UtilizationCPU usage60-80%
Memory UsageRAM consumption< 80%

Interpreting Results

Symptoms: p95 or p99 latency is highPossible Causes:
  • Model complexity
  • Insufficient resources
  • Cold starts
  • Network bottlenecks
Solutions:
  • Optimize model (quantization, pruning)
  • Increase resource limits
  • Implement request batching
  • Use model caching
Symptoms: RPS is lower than expectedPossible Causes:
  • Sequential request processing
  • GPU underutilization
  • I/O bottlenecks
Solutions:
  • Enable async inference
  • Implement dynamic batching
  • Scale horizontally
  • Use faster storage
Symptoms: Many 5xx errorsPossible Causes:
  • OOM errors
  • Timeout issues
  • Resource exhaustion
Solutions:
  • Increase memory limits
  • Adjust timeouts
  • Implement autoscaling
  • Add circuit breakers

Best Practices

Start Small

Begin with low load and gradually increase to find breaking points

Test Realistic Scenarios

Use production-like data and traffic patterns

Monitor Resources

Track CPU, memory, GPU utilization during tests

Repeat Tests

Run multiple tests to account for variability
Resource Limits: Ensure your load testing infrastructure doesn’t become the bottleneck. For high-load tests, use distributed testing.

Additional Tools

Other load testing tools worth exploring:
  • Gatling: Scala-based load testing for high performance
  • ghz: gRPC load testing tool
  • Artillery: Modern load testing toolkit
  • JMeter: Java-based load testing (older but powerful)

Next Steps

Autoscaling

Learn how to automatically scale your ML services based on load

Build docs developers (and LLMs) love