Skip to main content

Production Deployment Best Practices

Guidelines for deploying Aurora in production environments with security, reliability, and scalability.

Security

Secrets Management

Critical Security Requirements:
  1. Never commit secrets to version control
  2. Use strong, randomly generated passwords
  3. Rotate credentials regularly
  4. Use managed secrets services when available

Generate Strong Secrets

# Generate random secrets (32-byte base64)
openssl rand -base64 32

# Generate for all required secrets:
# - POSTGRES_PASSWORD
# - FLASK_SECRET_KEY
# - AUTH_SECRET
# - SEARXNG_SECRET
# - VAULT_TOKEN (from vault init)

Kubernetes Secrets

For Kubernetes deployments, consider using: External Secrets Operator:
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: aurora-secrets
spec:
  secretStoreRef:
    name: aws-secrets-manager
  target:
    name: aurora-app-secrets
  data:
    - secretKey: FLASK_SECRET_KEY
      remoteRef:
        key: aurora/flask-secret
Sealed Secrets:
# Encrypt secrets for git
kubeseal --format yaml < secret.yaml > sealed-secret.yaml
git add sealed-secret.yaml

Docker Compose Secrets

For Docker Compose, use .env file with restricted permissions:
chmod 600 .env
chown root:root .env  # Or service account user
Or use Docker secrets:
secrets:
  postgres_password:
    file: ./secrets/postgres_password.txt

services:
  postgres:
    secrets:
      - postgres_password
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/postgres_password

Vault Configuration

Auto-Unseal with Cloud KMS

For production, configure Vault auto-unseal: AWS KMS:
vault:
  seal:
    type: "awskms"
    awskms:
      region: "us-east-1"
      kms_key_id: "alias/aurora-vault-unseal"
GCP Cloud KMS:
vault:
  seal:
    type: "gcpckms"
    gcpckms:
      project: "your-project-id"
      region: "us-central1"
      key_ring: "vault-keyring"
      crypto_key: "vault-unseal-key"

Vault High Availability

For HA Vault:
replicaCounts:
  vault: 3

vault:
  ha:
    enabled: true
    raft:
      enabled: true

Network Security

Kubernetes NetworkPolicies

Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: aurora-server-policy
  namespace: aurora
spec:
  podSelector:
    matchLabels:
      app: aurora-server
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: aurora-frontend
      ports:
        - protocol: TCP
          port: 5080
  egress:
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53
    # Allow database
    - to:
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432

Pod Isolation for Untrusted Code

Enable pod isolation for terminal commands:
config:
  ENABLE_POD_ISOLATION: "true"
  TERMINAL_NAMESPACE: "untrusted"
  TERMINAL_RUNTIME_CLASS: "gvisor"  # Sandbox runtime
The chart creates NetworkPolicies that:
  • Block terminal pods from accessing cluster services (Vault, DB, etc.)
  • Allow internet access for cloud API calls
  • Isolate untrusted workloads

TLS/HTTPS Configuration

Ingress TLS with cert-manager

ingress:
  enabled: true
  tls:
    enabled: true
    certManager:
      enabled: true
      issuer: "letsencrypt-prod"
      email: "[email protected]"
  
  hosts:
    frontend: "aurora.example.com"
    api: "api.aurora.example.com"
    ws: "ws.aurora.example.com"
Install cert-manager:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml

# Create ClusterIssuer
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx
EOF

Internal TLS (Service Mesh)

For encrypted internal traffic, use a service mesh: Istio:
istioctl install --set profile=default
kubectl label namespace aurora istio-injection=enabled
Linkerd:
linkerd install | kubectl apply -f -
kubectl annotate namespace aurora linkerd.io/inject=enabled

Access Control

Kubernetes RBAC

Limit who can access Aurora resources:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: aurora-admin
  namespace: aurora
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["*"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: aurora-admin-binding
  namespace: aurora
subjects:
  - kind: User
    name: [email protected]
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: aurora-admin
  apiGroup: rbac.authorization.k8s.io

Rate Limiting

Enable API rate limiting:
config:
  RATE_LIMITING_ENABLED: "true"
  RATE_LIMIT_HEADERS_ENABLED: "true"

secrets:
  app:
    RATE_LIMIT_BYPASS_TOKEN: "<secure-token-for-automation>"

Reliability

High Availability

Replica Configuration

replicaCounts:
  # Scalable services (3+ for HA)
  server: 3
  celeryWorker: 5
  chatbot: 2
  frontend: 2
  
  # Single instance (requires additional config for HA)
  celeryBeat: 1  # DO NOT scale (causes duplicate tasks)
  postgres: 1    # Use managed DB (RDS, Cloud SQL) for HA
  redis: 1       # Use managed Redis (ElastiCache) for HA
  vault: 1       # Configure Raft storage for HA

Pod Disruption Budgets

Prevent simultaneous pod evictions:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: aurora-server-pdb
  namespace: aurora
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: aurora-server

Health Checks

Ensure proper health check configuration:
# Kubernetes
livenessProbe:
  httpGet:
    path: /health
    port: 5080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 5080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Resource Management

Resource Requests and Limits

Set appropriate resource limits:
resources:
  server:
    requests:
      cpu: "500m"
      memory: "1Gi"
    limits:
      cpu: "2000m"
      memory: "4Gi"
  
  celeryWorker:
    requests:
      cpu: "200m"
      memory: "2Gi"
    limits:
      cpu: "1000m"
      memory: "8Gi"
  
  postgres:
    requests:
      cpu: "1000m"
      memory: "2Gi"
    limits:
      cpu: "4000m"
      memory: "8Gi"

Horizontal Pod Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: aurora-server-hpa
  namespace: aurora
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: aurora-oss-server
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Backup and Recovery

PostgreSQL Backups

Automated backups with CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: aurora
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15-alpine
              env:
                - name: PGHOST
                  value: aurora-oss-postgres
                - name: PGUSER
                  value: aurora
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: aurora-db-secret
                      key: POSTGRES_PASSWORD
              command:
                - /bin/sh
                - -c
                - |
                  pg_dump -Fc aurora_db > /backup/aurora_$(date +%Y%m%d_%H%M%S).dump
                  aws s3 cp /backup/*.dump s3://aurora-backups/postgres/
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              emptyDir: {}
          restartPolicy: OnFailure
Managed Database Backups: Use cloud provider automated backups:
  • AWS RDS: Automated snapshots, point-in-time recovery
  • GCP Cloud SQL: Automated backups, replicas
  • Azure Database: Geo-redundant backups

Volume Snapshots

# Create VolumeSnapshot
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snapshot-$(date +%Y%m%d)
  namespace: aurora
spec:
  volumeSnapshotClassName: csi-snapclass
  source:
    persistentVolumeClaimName: data-aurora-oss-postgres-0
EOF

Disaster Recovery Plan

  1. Regular backups: Daily PostgreSQL dumps, hourly volume snapshots
  2. Multi-region replication: Replicate backups to separate region
  3. Test restores: Monthly restore tests to staging environment
  4. Documentation: Maintain runbook for recovery procedures
  5. Monitoring: Alert on backup failures

Monitoring and Observability

Prometheus Metrics

Enable Prometheus monitoring:
config:
  OTEL_SERVICE_NAME: "aurora-production"
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://prometheus:9090"

Logging

Centralized logging with ELK or Loki:
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: aurora
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/aurora-*.log
        Parser            docker
        Tag               aurora.*
    
    [OUTPUT]
        Name              es
        Match             aurora.*
        Host              elasticsearch.logging.svc.cluster.local
        Port              9200
        Index             aurora
        Type              _doc

Alerting

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: aurora-alerts
  namespace: aurora
spec:
  groups:
    - name: aurora
      interval: 30s
      rules:
        - alert: AuroraPodDown
          expr: up{job="aurora-server"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Aurora server pod is down"
        
        - alert: HighMemoryUsage
          expr: container_memory_usage_bytes{pod=~"aurora-.*"} / container_spec_memory_limit_bytes > 0.9
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} is using > 90% memory"

Operations

Deployment Strategy

Rolling Updates

apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Blue-Green Deployment

# Deploy new version to separate namespace
helm install aurora-v2 ./deploy/helm/aurora \
  --namespace aurora-v2 --create-namespace \
  -f values.generated.yaml

# Switch traffic via ingress
kubectl patch ingress aurora-oss -n aurora -p '{"spec":{"rules":[{"host":"api.aurora.example.com","http":{"paths":[{"path":"/","pathType":"Prefix","backend":{"service":{"name":"aurora-v2-server","port":{"number":5080}}}}]}}]}}'

# Cleanup old version
helm uninstall aurora-oss -n aurora

Maintenance Windows

Database Migrations

# Run migrations before deployment
kubectl exec -it deployment/aurora-oss-server -n aurora -- \
  python -m flask db upgrade

# Verify schema version
kubectl exec -it statefulset/aurora-oss-postgres -n aurora -- \
  psql -U aurora -d aurora_db -c "SELECT version_num FROM alembic_version;"

Scaling Down for Maintenance

# Scale to 0
kubectl scale deployment aurora-oss-server --replicas=0 -n aurora

# Perform maintenance
# ...

# Scale back up
kubectl scale deployment aurora-oss-server --replicas=3 -n aurora

Cost Optimization

Use Managed Services

Replace in-cluster stateful services with managed alternatives:
  • Database: RDS, Cloud SQL, Azure Database (automated backups, HA)
  • Redis: ElastiCache, Memorystore, Azure Cache (managed persistence)
  • Object Storage: S3, GCS, Azure Blob (eliminate SeaweedFS)
  • Secrets: AWS Secrets Manager, GCP Secret Manager, Azure Key Vault

Resource Right-Sizing

Monitor actual usage and adjust:
# Check resource usage
kubectl top pods -n aurora
kubectl top nodes

# Use VPA recommendations
kubectl get vpa -n aurora

Node Autoscaling

# Cluster Autoscaler (cloud providers)
# Scales nodes based on pending pods

Checklist

Before going to production:
  • All secrets generated with openssl rand -base64 32
  • Vault configured with auto-unseal (cloud KMS)
  • TLS/HTTPS enabled with valid certificates
  • External object storage configured (S3, GCS, etc.)
  • Database backups configured and tested
  • Monitoring and alerting set up
  • Resource requests and limits configured
  • Replica counts set for HA (3+ for critical services)
  • NetworkPolicies applied
  • Pod isolation enabled (ENABLE_POD_ISOLATION=true)
  • Disaster recovery plan documented
  • Runbooks created for common operations
  • Rate limiting enabled
  • RBAC configured for team access
  • Log aggregation configured
  • Load testing performed

Next Steps

Scaling Guide

Scale Aurora for growing workloads

Monitoring Setup

Set up comprehensive monitoring

Backup & Recovery

Implement backup strategies

Troubleshooting

Common issues and solutions

Build docs developers (and LLMs) love