Upgrades and Migrations

Overview

Cadence supports zero-downtime rolling upgrades across all services. This guide covers upgrade procedures, schema migrations, version compatibility, and rollback strategies.

Upgrade Strategy

Cadence follows semantic versioning (MAJOR.MINOR.PATCH):

PATCH: Bug fixes, safe to deploy without schema changes
MINOR: New features, backward compatible, may require schema updates
MAJOR: Breaking changes, requires careful migration planning

Release Channels

Stable: Production-ready releases (tagged versions)
Pre-release: Release candidates (vX.Y.Z-rc.N)
Master: Development branch (not recommended for production)

Pre-Upgrade Checklist

Always test upgrades in a non-production environment before deploying to production.

1. Review Release Notes

Check the release notes for:

Breaking changes
Schema updates required
Configuration changes
Feature deprecations
Known issues

2. Backup Database

Cassandra Backup

# Snapshot all keyspaces
nodetool snapshot cadence
nodetool snapshot cadence_visibility

# Verify snapshots
nodetool listsnapshots

# Copy snapshots to remote storage
for host in $CASSANDRA_HOSTS; do
  ssh $host "tar czf /backup/cassandra-$(date +%Y%m%d).tar.gz \
    /var/lib/cassandra/data/*/snapshots/"
done

MySQL Backup

# Logical backup
mysqldump --single-transaction \
  --routines \
  --triggers \
  --databases cadence cadence_visibility \
  > cadence_backup_$(date +%Y%m%d).sql

# Or use Percona XtraBackup for large databases
xtrabackup --backup \
  --target-dir=/backup/cadence_$(date +%Y%m%d)

PostgreSQL Backup

# Logical backup
pg_dump cadence > cadence_backup_$(date +%Y%m%d).sql

# Or physical backup
pg_basebackup -D /backup/cadence_$(date +%Y%m%d) -Ft -z -P

3. Check Cluster Health

# Verify all services are healthy
curl http://frontend:9090/health
curl http://history:9091/health
curl http://matching:9092/health

# Check for stuck workflows
cadence admin workflow list --open --domain <domain>

# Verify no ongoing domain failovers
cadence admin domain list

4. Review Current Configuration

# Backup current configuration
cp /etc/cadence/config.yaml /etc/cadence/config.yaml.backup

# Check for deprecated configuration options
grep -i deprecated config.yaml

Schema Migration

Schema Versioning

Cadence uses versioned schema files:

schema/
  cassandra/
    cadence/
      versioned/
        v0.1/
        v0.2/
        ...
        v1.0/
    visibility/
      versioned/
        v0.1/
        ...
  mysql/
    v8/
      cadence/
        versioned/
          v0.1/
          ...
  postgres/
    v12/
      cadence/
        versioned/
          v0.1/
          ...

Schema Update Tools

Cassandra Schema Update

# Update cadence keyspace
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence \
  update-schema \
  --version 1.0

# Update visibility keyspace
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence_visibility \
  update-schema \
  --version 1.0

MySQL Schema Update

# Update cadence database
cadence-sql-tool \
  --ep 127.0.0.1:3306 \
  --db cadence \
  --plugin mysql \
  update-schema \
  --version 1.0

# Update visibility database
cadence-sql-tool \
  --ep 127.0.0.1:3306 \
  --db cadence_visibility \
  --plugin mysql \
  update-schema \
  --version 1.0

PostgreSQL Schema Update

# Update cadence database
cadence-sql-tool \
  --ep 127.0.0.1:5432 \
  --db cadence \
  --plugin postgres \
  update-schema \
  --version 1.0

Schema Version Verification

# Check current schema version
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence \
  version

# Or for SQL
cadence-sql-tool \
  --ep 127.0.0.1:3306 \
  --db cadence \
  --plugin mysql \
  version

Schema updates are idempotent. Running the same update multiple times is safe.

Rolling Upgrade Procedure

Service Upgrade Order

Upgrade services in this order to maintain compatibility:

Worker (optional, low risk)
Matching
History
Frontend

Upgrade one service tier completely before moving to the next tier.

Step-by-Step Upgrade

1. Upgrade Worker Service

# Worker service is stateless and can be upgraded immediately
kubectl set image deployment/cadence-worker \
  cadence=ubercadence/server:1.0.0

# Or for systemd
systemctl stop cadence-worker
cp /opt/cadence/bin/cadence-server /opt/cadence/bin/cadence-server.old
cp /tmp/cadence-server-new /opt/cadence/bin/cadence-server
systemctl start cadence-worker

2. Upgrade Matching Service

# Rolling upgrade with 25% max unavailable
kubectl rollout restart deployment/cadence-matching
kubectl rollout status deployment/cadence-matching

# Monitor for errors
kubectl logs -f deployment/cadence-matching --tail=100

3. Upgrade History Service

# History service requires careful shard migration
# Use small batches to minimize shard transfer impact
kubectl set image deployment/cadence-history \
  cadence=ubercadence/server:1.0.0

# Monitor shard ownership changes
watch -n 5 'cadence admin shard list --print-full-shard'

History service upgrades trigger shard ownership transfers. Expect temporary latency increase during upgrades.

4. Upgrade Frontend Service

# Frontend is stateless, safe to upgrade quickly
kubectl set image deployment/cadence-frontend \
  cadence=ubercadence/server:1.0.0

# Verify health
for i in {1..10}; do
  curl http://frontend:9090/health || echo "Failed"
  sleep 1
done

Kubernetes Rolling Update Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cadence-history
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1      # Upgrade one pod at a time
      maxSurge: 1            # Allow one extra pod during upgrade
  template:
    spec:
      containers:
      - name: history
        image: ubercadence/server:1.0.0
        readinessProbe:
          httpGet:
            path: /health
            port: 9091
          initialDelaySeconds: 30
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 9091
          initialDelaySeconds: 60
          periodSeconds: 30

Version Compatibility

Service Version Skew

Cadence maintains backward compatibility:

N to N+1: Fully compatible (e.g., 0.24.0 → 0.25.0)
N to N+2: May work but not tested
N to N+3+: Not supported

Do not skip more than one minor version during upgrades. For major version upgrades (e.g., 0.x → 1.x), upgrade incrementally.

Client SDK Compatibility

Client SDKs are forward and backward compatible:

SDK Version	Server Versions
Go SDK 1.x	Server 0.20+
Java SDK 1.x	Server 0.20+
Python SDK 1.x	Server 0.20+

Protocol Compatibility

TChannel: Legacy, deprecated but still supported
gRPC: Preferred, fully supported from 0.23.0+
HTTP/JSON: Frontend only, experimental

Configuration Migration

Deprecated Configuration Options

Check for deprecated options before upgrading:

# Old (deprecated)
clusterMetadata:
  masterClusterName: "primary"

# New
clusterGroupMetadata:
  primaryClusterName: "primary"

# Old (deprecated)
clusterMetadata:
  clusterInformation:
    cluster1: {...}

# New
clusterGroupMetadata:
  clusterGroup:
    cluster1: {...}

Dynamic Config Migration

Dynamic config keys are generally backward compatible, but check release notes:

# Before 0.25.0
history.cacheSize: 1000

# After 0.25.0
history.historyCacheMaxSize: 1000

Post-Upgrade Validation

1. Verify Service Health

# Check all services
for svc in frontend history matching worker; do
  echo "Checking $svc..."
  curl -f http://$svc:909x/health || echo "$svc unhealthy!"
done

2. Verify Workflow Operations

# Start a test workflow
cadence workflow start \
  --domain test-domain \
  --tasklist test-tasklist \
  --workflow_type TestWorkflow \
  --execution_timeout 60

# List workflows
cadence workflow list --domain test-domain

# Describe workflow
cadence workflow describe \
  --domain test-domain \
  --workflow_id <wf-id>

3. Check Metrics

# Service restart count should increase by 1 per host
increase(cadence_restarts[10m])

# Error rate should remain low
rate(cadence_frontend_client_errors[5m]) < 0.01

# Latency should be normal
histogram_quantile(0.99, cadence_history_client_latency) < 1.0

4. Verify Persistence

# Check schema version matches target
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version

# Verify no persistence errors
grep -i "persistence error" /var/log/cadence/*.log

Rollback Procedures

When to Rollback

Rollback if you observe:

High error rates (>5% for >5 minutes)
Service crashes (restart loops)
Data corruption (workflow state inconsistencies)
Schema migration failures

Application Rollback

Quick Rollback (Kubernetes)

# Rollback to previous version
kubectl rollout undo deployment/cadence-history

# Or to specific revision
kubectl rollout undo deployment/cadence-history --to-revision=5

# Verify rollback
kubectl rollout status deployment/cadence-history

Manual Rollback (Systemd)

# Restore previous binary
for host in $CADENCE_HOSTS; do
  ssh $host "systemctl stop cadence-history && \
    cp /opt/cadence/bin/cadence-server.old \
       /opt/cadence/bin/cadence-server && \
    systemctl start cadence-history"
done

Schema Rollback

Schema rollbacks are risky and should be avoided. Most schema changes are additive and backward compatible.

If schema rollback is necessary:

Cassandra

# Restore from snapshot
nodetool clearsnapshot cadence
nodetool refresh cadence <table>

# Or restore from backup
for host in $CASSANDRA_HOSTS; do
  ssh $host "systemctl stop cassandra && \
    rm -rf /var/lib/cassandra/data/cadence/* && \
    tar xzf /backup/cassandra-backup.tar.gz -C / && \
    systemctl start cassandra"
done

MySQL

# Restore from backup
mysql cadence < cadence_backup.sql
mysql cadence_visibility < visibility_backup.sql

# Verify restoration
mysql -e "SELECT curr_version FROM schema_version" cadence

Rollback Validation

# Verify service versions
for pod in $(kubectl get pods -l app=cadence-history -o name); do
  kubectl describe $pod | grep Image:
done

# Check schema versions
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version

# Test workflow operations
cadence workflow start --domain test-domain ...

Multi-Cluster Upgrades

For global domains with cross-DC replication:

Upgrade Sequence

Upgrade standby clusters first
Verify replication is working
Upgrade active cluster
Monitor cross-cluster traffic

# Upgrade standby cluster 1
kubectl config use-context standby-1
kubectl set image deployment/cadence-history ...

# Verify replication
cadence admin domain describe --domain global-domain

# Repeat for other standby clusters
# Finally upgrade active cluster
kubectl config use-context active
kubectl set image deployment/cadence-history ...

Cross-cluster RPC is version-tolerant. Standby clusters can run newer versions than active clusters temporarily.

Common Upgrade Issues

Schema Migration Timeouts

# Increase timeout for large tables
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence \
  --timeout 600 \
  update-schema --version 1.0

Shard Ownership Churn

If shards keep transferring:

# Check for version skew
kubectl get pods -l app=cadence-history \
  -o jsonpath='{.items[*].spec.containers[0].image}'

# Verify all hosts have same version

Persistence Version Mismatch

If service fails to start with schema version error:

# Check expected vs actual schema version
grep "schema version" /var/log/cadence/history.log

# Update schema to match
cadence-cassandra-tool update-schema --version X.Y

Automated Upgrade Testing

Canary Deployments

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: cadence-history
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cadence-history
  progressDeadlineSeconds: 3600
  service:
    port: 7934
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m

Blue-Green Deployments

For maximum safety:

Deploy new version to separate cluster
Run synthetic tests
Switch DNS/load balancer
Keep old cluster as backup

Best Practices

1. Gradual Rollout

Start with 1-2 hosts per service
Monitor for 15-30 minutes
Continue if no issues

2. Automated Validation

Run synthetic workflows post-upgrade
Check metrics automatically
Alert on anomalies

3. Rollback Readiness

Keep previous binaries
Maintain database backups
Document rollback procedure

4. Communication

Notify stakeholders before upgrade
Provide status updates
Document issues and resolutions

Get Started

Core Concepts

Architecture

Deployment

Operations

Client SDKs

​Overview

​Upgrade Strategy

​Release Channels

​Pre-Upgrade Checklist

​1. Review Release Notes

​2. Backup Database

​Cassandra Backup

​MySQL Backup

​PostgreSQL Backup

​3. Check Cluster Health

​4. Review Current Configuration

​Schema Migration

​Schema Versioning

​Schema Update Tools

​Cassandra Schema Update

​MySQL Schema Update

​PostgreSQL Schema Update

​Schema Version Verification

​Rolling Upgrade Procedure

​Service Upgrade Order

​Step-by-Step Upgrade

​1. Upgrade Worker Service

​2. Upgrade Matching Service

​3. Upgrade History Service

​4. Upgrade Frontend Service

​Kubernetes Rolling Update Configuration

​Version Compatibility

​Service Version Skew

​Client SDK Compatibility

​Protocol Compatibility

​Configuration Migration

​Deprecated Configuration Options

​Dynamic Config Migration

​Post-Upgrade Validation

​1. Verify Service Health

​2. Verify Workflow Operations

​3. Check Metrics

​4. Verify Persistence

​Rollback Procedures

​When to Rollback

​Application Rollback

​Quick Rollback (Kubernetes)

​Manual Rollback (Systemd)

​Schema Rollback

​Cassandra

​MySQL

​Rollback Validation

​Multi-Cluster Upgrades

​Upgrade Sequence

​Common Upgrade Issues

​Schema Migration Timeouts

​Shard Ownership Churn

​Persistence Version Mismatch

​Automated Upgrade Testing

​Canary Deployments

​Blue-Green Deployments

​Best Practices

​1. Gradual Rollout

​2. Automated Validation

​3. Rollback Readiness

​4. Communication

​See Also

Build docs developers (and LLMs) love

Overview

Upgrade Strategy

Release Channels

Pre-Upgrade Checklist

1. Review Release Notes

2. Backup Database

Cassandra Backup

MySQL Backup

PostgreSQL Backup

3. Check Cluster Health

4. Review Current Configuration

Schema Migration

Schema Versioning

Schema Update Tools

Cassandra Schema Update

MySQL Schema Update

PostgreSQL Schema Update

Schema Version Verification

Rolling Upgrade Procedure

Service Upgrade Order

Step-by-Step Upgrade

1. Upgrade Worker Service

2. Upgrade Matching Service

3. Upgrade History Service

4. Upgrade Frontend Service

Kubernetes Rolling Update Configuration

Version Compatibility

Service Version Skew

Client SDK Compatibility

Protocol Compatibility

Configuration Migration

Deprecated Configuration Options

Dynamic Config Migration

Post-Upgrade Validation

1. Verify Service Health

2. Verify Workflow Operations

3. Check Metrics

4. Verify Persistence

Rollback Procedures

When to Rollback

Application Rollback

Quick Rollback (Kubernetes)

Manual Rollback (Systemd)

Schema Rollback

Cassandra

MySQL

Rollback Validation

Multi-Cluster Upgrades

Upgrade Sequence

Common Upgrade Issues

Schema Migration Timeouts

Shard Ownership Churn

Persistence Version Mismatch

Automated Upgrade Testing

Canary Deployments

Blue-Green Deployments

Best Practices

1. Gradual Rollout

2. Automated Validation

3. Rollback Readiness

4. Communication

See Also