Skip to main content

Overview

Cadence supports zero-downtime rolling upgrades across all services. This guide covers upgrade procedures, schema migrations, version compatibility, and rollback strategies.

Upgrade Strategy

Cadence follows semantic versioning (MAJOR.MINOR.PATCH):
  • PATCH: Bug fixes, safe to deploy without schema changes
  • MINOR: New features, backward compatible, may require schema updates
  • MAJOR: Breaking changes, requires careful migration planning

Release Channels

  • Stable: Production-ready releases (tagged versions)
  • Pre-release: Release candidates (vX.Y.Z-rc.N)
  • Master: Development branch (not recommended for production)

Pre-Upgrade Checklist

Always test upgrades in a non-production environment before deploying to production.

1. Review Release Notes

Check the release notes for:
  • Breaking changes
  • Schema updates required
  • Configuration changes
  • Feature deprecations
  • Known issues

2. Backup Database

Cassandra Backup

# Snapshot all keyspaces
nodetool snapshot cadence
nodetool snapshot cadence_visibility

# Verify snapshots
nodetool listsnapshots

# Copy snapshots to remote storage
for host in $CASSANDRA_HOSTS; do
  ssh $host "tar czf /backup/cassandra-$(date +%Y%m%d).tar.gz \
    /var/lib/cassandra/data/*/snapshots/"
done

MySQL Backup

# Logical backup
mysqldump --single-transaction \
  --routines \
  --triggers \
  --databases cadence cadence_visibility \
  > cadence_backup_$(date +%Y%m%d).sql

# Or use Percona XtraBackup for large databases
xtrabackup --backup \
  --target-dir=/backup/cadence_$(date +%Y%m%d)

PostgreSQL Backup

# Logical backup
pg_dump cadence > cadence_backup_$(date +%Y%m%d).sql

# Or physical backup
pg_basebackup -D /backup/cadence_$(date +%Y%m%d) -Ft -z -P

3. Check Cluster Health

# Verify all services are healthy
curl http://frontend:9090/health
curl http://history:9091/health
curl http://matching:9092/health

# Check for stuck workflows
cadence admin workflow list --open --domain <domain>

# Verify no ongoing domain failovers
cadence admin domain list

4. Review Current Configuration

# Backup current configuration
cp /etc/cadence/config.yaml /etc/cadence/config.yaml.backup

# Check for deprecated configuration options
grep -i deprecated config.yaml

Schema Migration

Schema Versioning

Cadence uses versioned schema files:
schema/
  cassandra/
    cadence/
      versioned/
        v0.1/
        v0.2/
        ...
        v1.0/
    visibility/
      versioned/
        v0.1/
        ...
  mysql/
    v8/
      cadence/
        versioned/
          v0.1/
          ...
  postgres/
    v12/
      cadence/
        versioned/
          v0.1/
          ...

Schema Update Tools

Cassandra Schema Update

# Update cadence keyspace
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence \
  update-schema \
  --version 1.0

# Update visibility keyspace
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence_visibility \
  update-schema \
  --version 1.0

MySQL Schema Update

# Update cadence database
cadence-sql-tool \
  --ep 127.0.0.1:3306 \
  --db cadence \
  --plugin mysql \
  update-schema \
  --version 1.0

# Update visibility database
cadence-sql-tool \
  --ep 127.0.0.1:3306 \
  --db cadence_visibility \
  --plugin mysql \
  update-schema \
  --version 1.0

PostgreSQL Schema Update

# Update cadence database
cadence-sql-tool \
  --ep 127.0.0.1:5432 \
  --db cadence \
  --plugin postgres \
  update-schema \
  --version 1.0

Schema Version Verification

# Check current schema version
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence \
  version

# Or for SQL
cadence-sql-tool \
  --ep 127.0.0.1:3306 \
  --db cadence \
  --plugin mysql \
  version
Schema updates are idempotent. Running the same update multiple times is safe.

Rolling Upgrade Procedure

Service Upgrade Order

Upgrade services in this order to maintain compatibility:
  1. Worker (optional, low risk)
  2. Matching
  3. History
  4. Frontend
Upgrade one service tier completely before moving to the next tier.

Step-by-Step Upgrade

1. Upgrade Worker Service

# Worker service is stateless and can be upgraded immediately
kubectl set image deployment/cadence-worker \
  cadence=ubercadence/server:1.0.0

# Or for systemd
systemctl stop cadence-worker
cp /opt/cadence/bin/cadence-server /opt/cadence/bin/cadence-server.old
cp /tmp/cadence-server-new /opt/cadence/bin/cadence-server
systemctl start cadence-worker

2. Upgrade Matching Service

# Rolling upgrade with 25% max unavailable
kubectl rollout restart deployment/cadence-matching
kubectl rollout status deployment/cadence-matching

# Monitor for errors
kubectl logs -f deployment/cadence-matching --tail=100

3. Upgrade History Service

# History service requires careful shard migration
# Use small batches to minimize shard transfer impact
kubectl set image deployment/cadence-history \
  cadence=ubercadence/server:1.0.0

# Monitor shard ownership changes
watch -n 5 'cadence admin shard list --print-full-shard'
History service upgrades trigger shard ownership transfers. Expect temporary latency increase during upgrades.

4. Upgrade Frontend Service

# Frontend is stateless, safe to upgrade quickly
kubectl set image deployment/cadence-frontend \
  cadence=ubercadence/server:1.0.0

# Verify health
for i in {1..10}; do
  curl http://frontend:9090/health || echo "Failed"
  sleep 1
done

Kubernetes Rolling Update Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cadence-history
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1      # Upgrade one pod at a time
      maxSurge: 1            # Allow one extra pod during upgrade
  template:
    spec:
      containers:
      - name: history
        image: ubercadence/server:1.0.0
        readinessProbe:
          httpGet:
            path: /health
            port: 9091
          initialDelaySeconds: 30
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 9091
          initialDelaySeconds: 60
          periodSeconds: 30

Version Compatibility

Service Version Skew

Cadence maintains backward compatibility:
  • N to N+1: Fully compatible (e.g., 0.24.0 → 0.25.0)
  • N to N+2: May work but not tested
  • N to N+3+: Not supported
Do not skip more than one minor version during upgrades. For major version upgrades (e.g., 0.x → 1.x), upgrade incrementally.

Client SDK Compatibility

Client SDKs are forward and backward compatible:
SDK VersionServer Versions
Go SDK 1.xServer 0.20+
Java SDK 1.xServer 0.20+
Python SDK 1.xServer 0.20+

Protocol Compatibility

  • TChannel: Legacy, deprecated but still supported
  • gRPC: Preferred, fully supported from 0.23.0+
  • HTTP/JSON: Frontend only, experimental

Configuration Migration

Deprecated Configuration Options

Check for deprecated options before upgrading:
# Old (deprecated)
clusterMetadata:
  masterClusterName: "primary"

# New
clusterGroupMetadata:
  primaryClusterName: "primary"
# Old (deprecated)
clusterMetadata:
  clusterInformation:
    cluster1: {...}

# New
clusterGroupMetadata:
  clusterGroup:
    cluster1: {...}

Dynamic Config Migration

Dynamic config keys are generally backward compatible, but check release notes:
# Before 0.25.0
history.cacheSize: 1000

# After 0.25.0
history.historyCacheMaxSize: 1000

Post-Upgrade Validation

1. Verify Service Health

# Check all services
for svc in frontend history matching worker; do
  echo "Checking $svc..."
  curl -f http://$svc:909x/health || echo "$svc unhealthy!"
done

2. Verify Workflow Operations

# Start a test workflow
cadence workflow start \
  --domain test-domain \
  --tasklist test-tasklist \
  --workflow_type TestWorkflow \
  --execution_timeout 60

# List workflows
cadence workflow list --domain test-domain

# Describe workflow
cadence workflow describe \
  --domain test-domain \
  --workflow_id <wf-id>

3. Check Metrics

# Service restart count should increase by 1 per host
increase(cadence_restarts[10m])

# Error rate should remain low
rate(cadence_frontend_client_errors[5m]) < 0.01

# Latency should be normal
histogram_quantile(0.99, cadence_history_client_latency) < 1.0

4. Verify Persistence

# Check schema version matches target
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version

# Verify no persistence errors
grep -i "persistence error" /var/log/cadence/*.log

Rollback Procedures

When to Rollback

Rollback if you observe:
  • High error rates (>5% for >5 minutes)
  • Service crashes (restart loops)
  • Data corruption (workflow state inconsistencies)
  • Schema migration failures

Application Rollback

Quick Rollback (Kubernetes)

# Rollback to previous version
kubectl rollout undo deployment/cadence-history

# Or to specific revision
kubectl rollout undo deployment/cadence-history --to-revision=5

# Verify rollback
kubectl rollout status deployment/cadence-history

Manual Rollback (Systemd)

# Restore previous binary
for host in $CADENCE_HOSTS; do
  ssh $host "systemctl stop cadence-history && \
    cp /opt/cadence/bin/cadence-server.old \
       /opt/cadence/bin/cadence-server && \
    systemctl start cadence-history"
done

Schema Rollback

Schema rollbacks are risky and should be avoided. Most schema changes are additive and backward compatible.
If schema rollback is necessary:

Cassandra

# Restore from snapshot
nodetool clearsnapshot cadence
nodetool refresh cadence <table>

# Or restore from backup
for host in $CASSANDRA_HOSTS; do
  ssh $host "systemctl stop cassandra && \
    rm -rf /var/lib/cassandra/data/cadence/* && \
    tar xzf /backup/cassandra-backup.tar.gz -C / && \
    systemctl start cassandra"
done

MySQL

# Restore from backup
mysql cadence < cadence_backup.sql
mysql cadence_visibility < visibility_backup.sql

# Verify restoration
mysql -e "SELECT curr_version FROM schema_version" cadence

Rollback Validation

# Verify service versions
for pod in $(kubectl get pods -l app=cadence-history -o name); do
  kubectl describe $pod | grep Image:
done

# Check schema versions
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version

# Test workflow operations
cadence workflow start --domain test-domain ...

Multi-Cluster Upgrades

For global domains with cross-DC replication:

Upgrade Sequence

  1. Upgrade standby clusters first
  2. Verify replication is working
  3. Upgrade active cluster
  4. Monitor cross-cluster traffic
# Upgrade standby cluster 1
kubectl config use-context standby-1
kubectl set image deployment/cadence-history ...

# Verify replication
cadence admin domain describe --domain global-domain

# Repeat for other standby clusters
# Finally upgrade active cluster
kubectl config use-context active
kubectl set image deployment/cadence-history ...
Cross-cluster RPC is version-tolerant. Standby clusters can run newer versions than active clusters temporarily.

Common Upgrade Issues

Schema Migration Timeouts

# Increase timeout for large tables
cadence-cassandra-tool \
  --ep 127.0.0.1 \
  --keyspace cadence \
  --timeout 600 \
  update-schema --version 1.0

Shard Ownership Churn

If shards keep transferring:
# Check for version skew
kubectl get pods -l app=cadence-history \
  -o jsonpath='{.items[*].spec.containers[0].image}'

# Verify all hosts have same version

Persistence Version Mismatch

If service fails to start with schema version error:
# Check expected vs actual schema version
grep "schema version" /var/log/cadence/history.log

# Update schema to match
cadence-cassandra-tool update-schema --version X.Y

Automated Upgrade Testing

Canary Deployments

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: cadence-history
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cadence-history
  progressDeadlineSeconds: 3600
  service:
    port: 7934
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m

Blue-Green Deployments

For maximum safety:
  1. Deploy new version to separate cluster
  2. Run synthetic tests
  3. Switch DNS/load balancer
  4. Keep old cluster as backup

Best Practices

1. Gradual Rollout

  • Start with 1-2 hosts per service
  • Monitor for 15-30 minutes
  • Continue if no issues

2. Automated Validation

  • Run synthetic workflows post-upgrade
  • Check metrics automatically
  • Alert on anomalies

3. Rollback Readiness

  • Keep previous binaries
  • Maintain database backups
  • Document rollback procedure

4. Communication

  • Notify stakeholders before upgrade
  • Provide status updates
  • Document issues and resolutions

See Also

Build docs developers (and LLMs) love