Skip to main content
Deploy CVAT on Kubernetes for production environments requiring high availability, horizontal scaling, and enterprise-grade reliability.

Prerequisites

Kubernetes Cluster

  • Kubernetes 1.23.0 or higher
  • kubectl configured and connected to your cluster
  • Cluster with at least:
    • 3 nodes (for high availability)
    • 8 CPU cores total
    • 16GB RAM total
    • 200GB storage

Required Tools

# Install Helm 3
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Verify installation
helm version
kubectl version --client

Storage Provider

Kubernetes cluster must have a default StorageClass or configure one:
# Check available storage classes
kubectl get storageclass
You need:
  • ReadWriteMany (RWX): For shared backend storage
  • ReadWriteOnce (RWO): For databases (PostgreSQL, ClickHouse, Kvrocks)

Ingress Controller (Optional)

For external access, install an ingress controller:
# Example: Nginx Ingress
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx
Or enable the embedded Traefik ingress.

Installation

1. Add Helm Repository

Add the CVAT Helm chart repository:
helm repo add cvat https://cvat-ai.github.io/cvat/
helm repo update

2. Create Namespace

kubectl create namespace cvat

3. Basic Installation

Install CVAT with default configuration:
helm install cvat cvat/cvat -n cvat
This creates:
  • CVAT backend deployment (server + workers)
  • CVAT frontend deployment
  • PostgreSQL StatefulSet
  • Redis StatefulSet
  • Kvrocks StatefulSet
  • ClickHouse StatefulSet
  • Open Policy Agent deployment
  • Vector for log collection
  • Grafana for analytics
  • Required services and PVCs

4. Wait for Pods to Start

# Watch pod status
kubectl get pods -n cvat -w

# Check all resources
kubectl get all -n cvat
Initialization takes 2-5 minutes.

5. Create Superuser

After all pods are running:
# Find the server pod
kubectl get pods -n cvat | grep cvat-backend-server

# Create superuser
kubectl exec -it -n cvat <cvat-backend-server-pod> -- \
  python manage.py createsuperuser

6. Access CVAT

Port Forward (Testing):
kubectl port-forward -n cvat service/cvat-frontend 8000:8000
Then access: http://localhost:8000 Or configure Ingress for production (see below).

Configuration

Custom Values File

Create cvat-values.yaml to customize your deployment:
# cvat-values.yaml
cvat:
  backend:
    server:
      replicas: 2
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: 2000m
          memory: 4Gi
      envs:
        ALLOWED_HOSTS: '*'
    worker:
      export:
        replicas: 3
      import:
        replicas: 3
      chunks:
        replicas: 3
    image: cvat/server
    tag: v2.10.0  # Use specific version
    defaultStorage:
      enabled: true
      size: 50Gi
      storageClassName: fast-ssd

  frontend:
    replicas: 2
    image: cvat/ui
    tag: v2.10.0
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 500m
        memory: 512Mi

  kvrocks:
    enabled: true
    defaultStorage:
      enabled: true
      size: 200Gi
      storageClassName: fast-ssd

postgresql:
  enabled: true
  auth:
    username: cvat
    database: cvat
    password: changeme123  # Use strong password
  primary:
    persistence:
      size: 20Gi
      storageClass: fast-ssd

redis:
  enabled: true
  auth:
    password: redis_secure_password
  master:
    persistence:
      size: 5Gi

clickhouse:
  enabled: true
  auth:
    username: user
    password: clickhouse_password
  shards: 1
  replicaCount: 1
  persistence:
    size: 50Gi

analytics:
  enabled: true
  clickhousePassword: clickhouse_password

ingress:
  enabled: true
  hostname: cvat.example.com
  className: nginx
  tls: true
  tlsSecretName: cvat-tls
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
Install with custom values:
helm install cvat cvat/cvat -n cvat -f cvat-values.yaml

Ingress Configuration

Using Nginx Ingress

ingress:
  enabled: true
  hostname: cvat.example.com
  className: nginx
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
    nginx.ingress.kubernetes.io/client-body-buffer-size: "128k"
  tls: true
  tlsSecretName: cvat-tls-secret

Using Embedded Traefik

traefik:
  enabled: true
  service:
    type: LoadBalancer
  ports:
    web:
      port: 80
    websecure:
      port: 443

External Database

Use an external PostgreSQL database:
postgresql:
  enabled: false
  external:
    host: postgres.example.com
    port: 5432
  auth:
    username: cvat
    database: cvat
    password: secure_password
    existingSecret: cvat-postgres-secret
Create secret:
kubectl create secret generic cvat-postgres-secret -n cvat \
  --from-literal=password='your-password'

External Redis

redis:
  enabled: false
  external:
    host: redis.example.com
  auth:
    password: redis_password
    existingSecret: cvat-redis-secret

Scaling Workers

Adjust worker replicas based on load:
cvat:
  backend:
    worker:
      export:
        replicas: 5
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
      import:
        replicas: 5
      chunks:
        replicas: 4
      annotation:
        replicas: 2

High Availability

For production HA setup:
cvat:
  backend:
    server:
      replicas: 3  # Multiple server replicas
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - cvat-backend-server
              topologyKey: kubernetes.io/hostname

  frontend:
    replicas: 3

postgresql:
  enabled: true
  architecture: replication
  replication:
    enabled: true
    numSynchronousReplicas: 1
  readReplicas:
    replicaCount: 2

Chart Structure

The CVAT Helm chart (v2.58.1) includes:

Dependencies

Automatically installed:
  • postgresql (v12.1.x): Primary database
  • redis (v19.6.4): Caching layer
  • clickhouse (v4.1.x): Analytics database
  • vector (v0.19.x): Log aggregation
  • grafana (v6.60.x): Analytics UI
  • traefik (v37.3.x): Optional ingress
  • nuclio (v0.21.x): Optional serverless functions

Templates

Key Kubernetes resources created:
  • Deployments: cvat-backend-server, cvat-frontend, cvat-opa
  • StatefulSets: PostgreSQL, Redis, Kvrocks, ClickHouse
  • Deployments (Workers): Export, Import, Annotation, Webhooks, Quality Reports, Chunks, Consensus, Utils
  • Services: Frontend, Backend, OPA, Databases
  • PersistentVolumeClaims: Backend storage, Kvrocks cache, database storage
  • ConfigMaps: Application config, Vector config, Grafana dashboards
  • Secrets: Database credentials, Redis passwords, ClickHouse auth
  • Jobs: Backend initializer (runs migrations)
  • Ingress: Optional external access

Operations

Upgrade CVAT

# Update repo
helm repo update

# Check current version
helm list -n cvat

# Upgrade to latest
helm upgrade cvat cvat/cvat -n cvat -f cvat-values.yaml

# Or upgrade to specific version
helm upgrade cvat cvat/cvat -n cvat -f cvat-values.yaml --version 2.58.1

Rollback

# View history
helm history cvat -n cvat

# Rollback to previous
helm rollback cvat -n cvat

# Rollback to specific revision
helm rollback cvat 2 -n cvat

Uninstall

# Remove release
helm uninstall cvat -n cvat

# Delete namespace
kubectl delete namespace cvat
Note: PVCs may need manual deletion.

Backup and Restore

Backup PostgreSQL:
kubectl exec -n cvat <postgresql-pod> -- \
  pg_dumpall -U cvat > cvat_backup.sql
Backup PVCs using your storage provider’s snapshot feature or:
# Example using kubectl cp
kubectl exec -n cvat <backend-pod> -- tar czf /tmp/data.tar.gz /home/django/data
kubectl cp cvat/<backend-pod>:/tmp/data.tar.gz ./data_backup.tar.gz
Restore:
kubectl exec -i -n cvat <postgresql-pod> -- psql -U cvat cvat < cvat_backup.sql

View Logs

# Backend server logs
kubectl logs -n cvat -l app=cvat-backend-server -f

# Worker logs
kubectl logs -n cvat -l app=cvat-backend-worker-export -f

# All backend logs
kubectl logs -n cvat -l component=backend -f

Exec into Pods

# Server pod
kubectl exec -it -n cvat <server-pod> -- bash

# Run Django commands
kubectl exec -n cvat <server-pod> -- python manage.py migrate
kubectl exec -n cvat <server-pod> -- python manage.py collectstatic --noinput

Monitoring

Pod Status:
kubectl get pods -n cvat
kubectl top pods -n cvat
Service Status:
kubectl get svc -n cvat
Events:
kubectl get events -n cvat --sort-by='.lastTimestamp'
Resource Usage:
kubectl top nodes
kubectl top pods -n cvat

Troubleshooting

Pods Not Starting

Check pod status:
kubectl describe pod -n cvat <pod-name>
Common issues:
  • ImagePullBackOff: Check image name and registry access
  • CrashLoopBackOff: Check logs for application errors
  • Pending: Check storage class and resource availability

Database Connection Issues

# Check PostgreSQL pod
kubectl logs -n cvat -l app.kubernetes.io/name=postgresql

# Test connection from server
kubectl exec -n cvat <server-pod> -- \
  python manage.py dbshell

Storage Issues

# Check PVCs
kubectl get pvc -n cvat

# Check PV status
kubectl get pv

# Describe problematic PVC
kubectl describe pvc -n cvat <pvc-name>

Worker Not Processing Jobs

# Check worker logs
kubectl logs -n cvat -l app=cvat-backend-worker-export

# Check Redis connection
kubectl exec -n cvat <redis-pod> -- redis-cli ping

# Restart workers
kubectl rollout restart deployment -n cvat -l component=backend

Ingress Not Working

# Check ingress status
kubectl get ingress -n cvat
kubectl describe ingress -n cvat cvat

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

Advanced Configuration

Custom Storage Classes

cvat:
  backend:
    defaultStorage:
      enabled: true
      storageClassName: premium-rwo
      accessModes:
        - ReadWriteMany
      size: 100Gi

  kvrocks:
    defaultStorage:
      storageClassName: fast-ssd
      size: 200Gi
      volumeAttributesClass:
        create: true
        name: high-throughput
        provider: ebs.csi.aws.com
        parameters:
          type: gp3
          provisioned-throughput: "250"

Node Affinity and Tolerations

cvat:
  backend:
    server:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - compute
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "cvat"
        effect: "NoSchedule"

Additional Environment Variables

cvat:
  backend:
    additionalEnv:
    - name: DJANGO_LOG_LEVEL
      value: INFO
    - name: SMOKESCREEN_OPTS
      value: "--deny-address 169.254.169.254"
    server:
      additionalEnv:
      - name: CVAT_BASE_URL
        value: https://cvat.example.com

Custom Volumes

cvat:
  backend:
    additionalVolumes:
    - name: shared-data
      nfs:
        server: nfs.example.com
        path: /exports/cvat
    additionalVolumeMounts:
    - name: shared-data
      mountPath: /mnt/shared

Production Best Practices

  1. Use specific image tags: Don’t use dev or latest in production
  2. Enable resource limits: Prevent resource exhaustion
  3. Configure HPA: Auto-scale based on CPU/memory
  4. Use external databases: For better reliability and backups
  5. Enable monitoring: Use Prometheus/Grafana for metrics
  6. Regular backups: Automate database and volume backups
  7. TLS everywhere: Use cert-manager for automatic certificates
  8. Network policies: Restrict pod-to-pod communication
  9. Secrets management: Use external secret managers (Vault, AWS Secrets Manager)
  10. Multi-zone deployment: Spread pods across availability zones

Performance Tuning

cvat:
  backend:
    server:
      replicas: 5
      resources:
        requests:
          cpu: 2000m
          memory: 4Gi
        limits:
          cpu: 4000m
          memory: 8Gi
    worker:
      chunks:
        replicas: 10
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi

postgresql:
  primary:
    resources:
      requests:
        cpu: 2000m
        memory: 4Gi
    persistence:
      size: 100Gi
      storageClass: premium-ssd

redis:
  master:
    resources:
      requests:
        cpu: 1000m
        memory: 2Gi

Next Steps

Build docs developers (and LLMs) love