Overview
Cadence supports zero-downtime rolling upgrades across all services. This guide covers upgrade procedures, schema migrations, version compatibility, and rollback strategies.
Upgrade Strategy
Cadence follows semantic versioning (MAJOR.MINOR.PATCH):
- PATCH: Bug fixes, safe to deploy without schema changes
- MINOR: New features, backward compatible, may require schema updates
- MAJOR: Breaking changes, requires careful migration planning
Release Channels
- Stable: Production-ready releases (tagged versions)
- Pre-release: Release candidates (vX.Y.Z-rc.N)
- Master: Development branch (not recommended for production)
Pre-Upgrade Checklist
Always test upgrades in a non-production environment before deploying to production.
1. Review Release Notes
Check the release notes for:
- Breaking changes
- Schema updates required
- Configuration changes
- Feature deprecations
- Known issues
2. Backup Database
Cassandra Backup
# Snapshot all keyspaces
nodetool snapshot cadence
nodetool snapshot cadence_visibility
# Verify snapshots
nodetool listsnapshots
# Copy snapshots to remote storage
for host in $CASSANDRA_HOSTS; do
ssh $host "tar czf /backup/cassandra-$(date +%Y%m%d).tar.gz \
/var/lib/cassandra/data/*/snapshots/"
done
MySQL Backup
# Logical backup
mysqldump --single-transaction \
--routines \
--triggers \
--databases cadence cadence_visibility \
> cadence_backup_$(date +%Y%m%d).sql
# Or use Percona XtraBackup for large databases
xtrabackup --backup \
--target-dir=/backup/cadence_$(date +%Y%m%d)
PostgreSQL Backup
# Logical backup
pg_dump cadence > cadence_backup_$(date +%Y%m%d).sql
# Or physical backup
pg_basebackup -D /backup/cadence_$(date +%Y%m%d) -Ft -z -P
3. Check Cluster Health
# Verify all services are healthy
curl http://frontend:9090/health
curl http://history:9091/health
curl http://matching:9092/health
# Check for stuck workflows
cadence admin workflow list --open --domain <domain>
# Verify no ongoing domain failovers
cadence admin domain list
4. Review Current Configuration
# Backup current configuration
cp /etc/cadence/config.yaml /etc/cadence/config.yaml.backup
# Check for deprecated configuration options
grep -i deprecated config.yaml
Schema Migration
Schema Versioning
Cadence uses versioned schema files:
schema/
cassandra/
cadence/
versioned/
v0.1/
v0.2/
...
v1.0/
visibility/
versioned/
v0.1/
...
mysql/
v8/
cadence/
versioned/
v0.1/
...
postgres/
v12/
cadence/
versioned/
v0.1/
...
Cassandra Schema Update
# Update cadence keyspace
cadence-cassandra-tool \
--ep 127.0.0.1 \
--keyspace cadence \
update-schema \
--version 1.0
# Update visibility keyspace
cadence-cassandra-tool \
--ep 127.0.0.1 \
--keyspace cadence_visibility \
update-schema \
--version 1.0
MySQL Schema Update
# Update cadence database
cadence-sql-tool \
--ep 127.0.0.1:3306 \
--db cadence \
--plugin mysql \
update-schema \
--version 1.0
# Update visibility database
cadence-sql-tool \
--ep 127.0.0.1:3306 \
--db cadence_visibility \
--plugin mysql \
update-schema \
--version 1.0
PostgreSQL Schema Update
# Update cadence database
cadence-sql-tool \
--ep 127.0.0.1:5432 \
--db cadence \
--plugin postgres \
update-schema \
--version 1.0
Schema Version Verification
# Check current schema version
cadence-cassandra-tool \
--ep 127.0.0.1 \
--keyspace cadence \
version
# Or for SQL
cadence-sql-tool \
--ep 127.0.0.1:3306 \
--db cadence \
--plugin mysql \
version
Schema updates are idempotent. Running the same update multiple times is safe.
Rolling Upgrade Procedure
Service Upgrade Order
Upgrade services in this order to maintain compatibility:
- Worker (optional, low risk)
- Matching
- History
- Frontend
Upgrade one service tier completely before moving to the next tier.
Step-by-Step Upgrade
1. Upgrade Worker Service
# Worker service is stateless and can be upgraded immediately
kubectl set image deployment/cadence-worker \
cadence=ubercadence/server:1.0.0
# Or for systemd
systemctl stop cadence-worker
cp /opt/cadence/bin/cadence-server /opt/cadence/bin/cadence-server.old
cp /tmp/cadence-server-new /opt/cadence/bin/cadence-server
systemctl start cadence-worker
2. Upgrade Matching Service
# Rolling upgrade with 25% max unavailable
kubectl rollout restart deployment/cadence-matching
kubectl rollout status deployment/cadence-matching
# Monitor for errors
kubectl logs -f deployment/cadence-matching --tail=100
3. Upgrade History Service
# History service requires careful shard migration
# Use small batches to minimize shard transfer impact
kubectl set image deployment/cadence-history \
cadence=ubercadence/server:1.0.0
# Monitor shard ownership changes
watch -n 5 'cadence admin shard list --print-full-shard'
History service upgrades trigger shard ownership transfers. Expect temporary latency increase during upgrades.
4. Upgrade Frontend Service
# Frontend is stateless, safe to upgrade quickly
kubectl set image deployment/cadence-frontend \
cadence=ubercadence/server:1.0.0
# Verify health
for i in {1..10}; do
curl http://frontend:9090/health || echo "Failed"
sleep 1
done
Kubernetes Rolling Update Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cadence-history
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # Upgrade one pod at a time
maxSurge: 1 # Allow one extra pod during upgrade
template:
spec:
containers:
- name: history
image: ubercadence/server:1.0.0
readinessProbe:
httpGet:
path: /health
port: 9091
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 9091
initialDelaySeconds: 60
periodSeconds: 30
Version Compatibility
Service Version Skew
Cadence maintains backward compatibility:
- N to N+1: Fully compatible (e.g., 0.24.0 → 0.25.0)
- N to N+2: May work but not tested
- N to N+3+: Not supported
Do not skip more than one minor version during upgrades. For major version upgrades (e.g., 0.x → 1.x), upgrade incrementally.
Client SDK Compatibility
Client SDKs are forward and backward compatible:
| SDK Version | Server Versions |
|---|
| Go SDK 1.x | Server 0.20+ |
| Java SDK 1.x | Server 0.20+ |
| Python SDK 1.x | Server 0.20+ |
Protocol Compatibility
- TChannel: Legacy, deprecated but still supported
- gRPC: Preferred, fully supported from 0.23.0+
- HTTP/JSON: Frontend only, experimental
Configuration Migration
Deprecated Configuration Options
Check for deprecated options before upgrading:
# Old (deprecated)
clusterMetadata:
masterClusterName: "primary"
# New
clusterGroupMetadata:
primaryClusterName: "primary"
# Old (deprecated)
clusterMetadata:
clusterInformation:
cluster1: {...}
# New
clusterGroupMetadata:
clusterGroup:
cluster1: {...}
Dynamic Config Migration
Dynamic config keys are generally backward compatible, but check release notes:
# Before 0.25.0
history.cacheSize: 1000
# After 0.25.0
history.historyCacheMaxSize: 1000
Post-Upgrade Validation
1. Verify Service Health
# Check all services
for svc in frontend history matching worker; do
echo "Checking $svc..."
curl -f http://$svc:909x/health || echo "$svc unhealthy!"
done
2. Verify Workflow Operations
# Start a test workflow
cadence workflow start \
--domain test-domain \
--tasklist test-tasklist \
--workflow_type TestWorkflow \
--execution_timeout 60
# List workflows
cadence workflow list --domain test-domain
# Describe workflow
cadence workflow describe \
--domain test-domain \
--workflow_id <wf-id>
3. Check Metrics
# Service restart count should increase by 1 per host
increase(cadence_restarts[10m])
# Error rate should remain low
rate(cadence_frontend_client_errors[5m]) < 0.01
# Latency should be normal
histogram_quantile(0.99, cadence_history_client_latency) < 1.0
4. Verify Persistence
# Check schema version matches target
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version
# Verify no persistence errors
grep -i "persistence error" /var/log/cadence/*.log
Rollback Procedures
When to Rollback
Rollback if you observe:
- High error rates (>5% for >5 minutes)
- Service crashes (restart loops)
- Data corruption (workflow state inconsistencies)
- Schema migration failures
Application Rollback
Quick Rollback (Kubernetes)
# Rollback to previous version
kubectl rollout undo deployment/cadence-history
# Or to specific revision
kubectl rollout undo deployment/cadence-history --to-revision=5
# Verify rollback
kubectl rollout status deployment/cadence-history
Manual Rollback (Systemd)
# Restore previous binary
for host in $CADENCE_HOSTS; do
ssh $host "systemctl stop cadence-history && \
cp /opt/cadence/bin/cadence-server.old \
/opt/cadence/bin/cadence-server && \
systemctl start cadence-history"
done
Schema Rollback
Schema rollbacks are risky and should be avoided. Most schema changes are additive and backward compatible.
If schema rollback is necessary:
Cassandra
# Restore from snapshot
nodetool clearsnapshot cadence
nodetool refresh cadence <table>
# Or restore from backup
for host in $CASSANDRA_HOSTS; do
ssh $host "systemctl stop cassandra && \
rm -rf /var/lib/cassandra/data/cadence/* && \
tar xzf /backup/cassandra-backup.tar.gz -C / && \
systemctl start cassandra"
done
MySQL
# Restore from backup
mysql cadence < cadence_backup.sql
mysql cadence_visibility < visibility_backup.sql
# Verify restoration
mysql -e "SELECT curr_version FROM schema_version" cadence
Rollback Validation
# Verify service versions
for pod in $(kubectl get pods -l app=cadence-history -o name); do
kubectl describe $pod | grep Image:
done
# Check schema versions
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version
# Test workflow operations
cadence workflow start --domain test-domain ...
Multi-Cluster Upgrades
For global domains with cross-DC replication:
Upgrade Sequence
- Upgrade standby clusters first
- Verify replication is working
- Upgrade active cluster
- Monitor cross-cluster traffic
# Upgrade standby cluster 1
kubectl config use-context standby-1
kubectl set image deployment/cadence-history ...
# Verify replication
cadence admin domain describe --domain global-domain
# Repeat for other standby clusters
# Finally upgrade active cluster
kubectl config use-context active
kubectl set image deployment/cadence-history ...
Cross-cluster RPC is version-tolerant. Standby clusters can run newer versions than active clusters temporarily.
Common Upgrade Issues
Schema Migration Timeouts
# Increase timeout for large tables
cadence-cassandra-tool \
--ep 127.0.0.1 \
--keyspace cadence \
--timeout 600 \
update-schema --version 1.0
Shard Ownership Churn
If shards keep transferring:
# Check for version skew
kubectl get pods -l app=cadence-history \
-o jsonpath='{.items[*].spec.containers[0].image}'
# Verify all hosts have same version
Persistence Version Mismatch
If service fails to start with schema version error:
# Check expected vs actual schema version
grep "schema version" /var/log/cadence/history.log
# Update schema to match
cadence-cassandra-tool update-schema --version X.Y
Automated Upgrade Testing
Canary Deployments
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: cadence-history
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: cadence-history
progressDeadlineSeconds: 3600
service:
port: 7934
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
Blue-Green Deployments
For maximum safety:
- Deploy new version to separate cluster
- Run synthetic tests
- Switch DNS/load balancer
- Keep old cluster as backup
Best Practices
1. Gradual Rollout
- Start with 1-2 hosts per service
- Monitor for 15-30 minutes
- Continue if no issues
2. Automated Validation
- Run synthetic workflows post-upgrade
- Check metrics automatically
- Alert on anomalies
3. Rollback Readiness
- Keep previous binaries
- Maintain database backups
- Document rollback procedure
4. Communication
- Notify stakeholders before upgrade
- Provide status updates
- Document issues and resolutions
See Also