Production Best Practices

Deploying Solace Agent Mesh in production requires careful attention to security, reliability, performance, and operational excellence. This guide covers essential best practices to ensure your deployment is production-ready.

Security Best Practices

Secrets Management

Never store sensitive information in code or configuration files. Use dedicated secret management solutions:

Kubernetes Secrets

apiVersion: v1
kind: Secret
metadata:
  name: sam-secrets
  namespace: solace-agent-mesh
type: Opaque
stringData:
  SOLACE_BROKER_PASSWORD: "your-password"
  LLM_SERVICE_API_KEY: "sk-..."
  SESSION_SECRET_KEY: "your-secret-key"
  DATABASE_URL: "postgresql://user:pass@host/db"

Best practices:

Use kubectl create secret instead of YAML files
Enable encryption at rest in your cluster
Use RBAC to restrict secret access
Consider Sealed Secrets or External Secrets Operator

AWS Secrets Manager

import boto3
import json

def get_secret(secret_name):
    client = boto3.client('secretsmanager', region_name='us-east-1')
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response['SecretString'])

# In your deployment script
secrets = get_secret('solace-agent-mesh/prod')

Integration:

Use IAM roles for authentication
Enable automatic rotation
Use AWS Secrets and Configuration Provider (ASCP) for Kubernetes

HashiCorp Vault

# Store secrets in Vault
vault kv put secret/sam/prod \
  broker_password="your-password" \
  llm_api_key="sk-..." \
  session_secret="your-secret"

# Use Vault Agent for injection
vault agent -config=agent-config.hcl

Integration:

Use Vault Agent for sidecar injection
Configure dynamic secrets for databases
Enable audit logging

Azure Key Vault

# Store secrets in Key Vault
az keyvault secret set \
  --vault-name sam-prod-vault \
  --name llm-api-key \
  --value "sk-..."

# Use Azure Key Vault Provider for Secrets Store CSI Driver
kubectl apply -f secretproviderclass.yaml

Integration:

Use managed identities for authentication
Enable soft delete and purge protection
Use Azure Key Vault Provider for Kubernetes

TLS/SSL Configuration

Encrypt all communication channels: Solace Event Broker:

broker:
  url: wss://your-broker.messaging.solace.cloud:443  # Use WSS, not WS
  # For self-hosted brokers with custom CA
  ssl_trust_store: /path/to/ca-cert.pem

Database Connections:

# PostgreSQL with TLS
DATABASE_URL="postgresql://user:pass@host:5432/sam?sslmode=require"

# With custom CA certificate
DATABASE_URL="postgresql://user:pass@host:5432/sam?sslmode=verify-full&sslrootcert=/path/to/ca.pem"

Ingress/Load Balancer:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: solace-agent-mesh
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3"
spec:
  tls:
    - hosts:
        - agent-mesh.example.com
      secretName: agent-mesh-tls

Container Security

Run as non-root user: Agent Mesh containers run as UID 999 by default:

securityContext:
  runAsNonRoot: true
  runAsUser: 999
  fsGroup: 999
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

Image Security:

# Scan images for vulnerabilities
trivy image solace/solace-agent-mesh:latest

# Use specific versions, not 'latest'
image: solace/solace-agent-mesh:1.2.3

# Verify image signatures (if available)
cosign verify gcr.io/gcp-maas-prod/solace-agent-mesh:1.2.3

Network Policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: sam-network-policy
spec:
  podSelector:
    matchLabels:
      app: solace-agent-mesh
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - port: 5002
        - port: 8000
  egress:
    # Allow only necessary outbound connections
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - port: 53
          protocol: UDP

Authentication and Authorization

Configure Identity Provider (IdP):

# SAML/OIDC configuration
auth:
  provider: saml
  saml:
    entity_id: https://agent-mesh.example.com
    sso_url: https://idp.example.com/sso
    certificate: /path/to/idp-cert.pem
  
  # Or use OIDC
  oidc:
    issuer_url: https://accounts.google.com
    client_id: your-client-id
    client_secret: your-client-secret

API Authentication:

# Use API keys for programmatic access
API_KEY_HEADER="X-API-Key"
API_KEY_VALUE="your-secure-api-key"

Data Protection

Encrypt Sensitive Data:

# Database encryption at rest (RDS example)
aws rds create-db-instance \
  --db-instance-identifier sam-prod \
  --storage-encrypted \
  --kms-key-id arn:aws:kms:region:account:key/key-id

# S3 bucket encryption
aws s3api put-bucket-encryption \
  --bucket your-bucket \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "arn:aws:kms:region:account:key/key-id"
      }
    }]
  }'

Backup and Recovery:

# Automated PostgreSQL backups
aws rds modify-db-instance \
  --db-instance-identifier sam-prod \
  --backup-retention-period 7 \
  --preferred-backup-window "03:00-04:00"

# S3 versioning and lifecycle
aws s3api put-bucket-versioning \
  --bucket your-bucket \
  --versioning-configuration Status=Enabled

High Availability and Reliability

Multi-Replica Deployment

Run multiple instances:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: solace-agent-mesh
spec:
  replicas: 3  # Minimum 3 for HA
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero downtime
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: solace-agent-mesh
                topologyKey: kubernetes.io/hostname

Queue Configuration

Use durable queues for container environments:

broker:
  temporary_queue: false  # Use durable queues

Configure Queue Template in Solace Cloud:

Navigate to Message VPNs → Queues → Templates
Create template:
- Queue Name Filter: sam/>
- Respect TTL: true
- Maximum TTL: 18000 seconds (5 hours)
- Max Message Size: 10000000 bytes (10 MB)

This prevents message accumulation when agents restart.

Health Checks and Auto-Recovery

Configure comprehensive health checks:

containers:
  - name: agent-mesh
    # Startup probe - gives app time to initialize
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 30  # 150s total
    
    # Readiness probe - removes from service when not ready
    readinessProbe:
      httpGet:
        path: /readyz
        port: 8080
      periodSeconds: 10
      failureThreshold: 3
    
    # Liveness probe - restarts container when unhealthy
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 30
      failureThreshold: 3
    
    # Restart policy
    restartPolicy: Always

Resource Quotas and Limits

Define resource boundaries:

resources:
  requests:
    cpu: 175m
    memory: 625Mi
  limits:
    cpu: 200m
    memory: 1Gi

# Namespace resource quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: sam-quota
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 16Gi
    limits.cpu: "8"
    limits.memory: 32Gi
    persistentvolumeclaims: "10"

Database High Availability

Use managed database services with HA: AWS RDS:

aws rds create-db-instance \
  --db-instance-identifier sam-prod \
  --multi-az \
  --db-instance-class db.r6g.large \
  --engine postgres \
  --engine-version 17.2

Azure Database:

az postgres flexible-server create \
  --name sam-prod \
  --resource-group sam-rg \
  --sku-name Standard_D4s_v3 \
  --high-availability Enabled \
  --tier GeneralPurpose

Google Cloud SQL:

gcloud sql instances create sam-prod \
  --database-version=POSTGRES_17 \
  --tier=db-custom-4-16384 \
  --availability-type=REGIONAL \
  --region=us-central1

Performance Optimization

Database Performance

Connection Pooling:

# Use PgBouncer for connection pooling
DATABASE_URL="postgresql://user:pass@pgbouncer:6432/sam?pool_size=20&max_overflow=10"

Indexing:

-- Add indexes for common queries
CREATE INDEX idx_session_user_id ON sessions(user_id);
CREATE INDEX idx_artifact_created_at ON artifacts(created_at DESC);

Database Tuning:

# PostgreSQL configuration
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
wal_buffers = 16MB
max_connections = 200

Object Storage Optimization

S3 Performance:

# Enable S3 Transfer Acceleration
aws s3api put-bucket-accelerate-configuration \
  --bucket your-bucket \
  --accelerate-configuration Status=Enabled

# Use intelligent tiering for cost optimization
aws s3api put-bucket-intelligent-tiering-configuration \
  --bucket your-bucket \
  --id sam-tiering \
  --intelligent-tiering-configuration '{
    "Id": "sam-tiering",
    "Status": "Enabled",
    "Tierings": [
      {"Days": 90, "AccessTier": "ARCHIVE_ACCESS"},
      {"Days": 180, "AccessTier": "DEEP_ARCHIVE_ACCESS"}
    ]
  }'

Autoscaling

Horizontal Pod Autoscaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sam-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: solace-agent-mesh
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Don't scale down too quickly
    scaleUp:
      stabilizationWindowSeconds: 60   # Scale up faster

Cluster Autoscaling: Enable cluster autoscaler in your Kubernetes platform to automatically add nodes when needed.

Caching Strategies

LLM Prompt Caching:

models:
  planning:
    model: openai/gpt-4
    cache_strategy: "5m"  # Cache prompts for 5 minutes
  
  general:
    model: openai/gpt-4
    cache_strategy: "1h"  # Cache for 1 hour

Monitoring and Observability

Application Logging

Structured JSON Logging:

# logging_config.yaml
version: 1
formatters:
  jsonFormatter:
    "()": pythonjsonlogger.json.JsonFormatter
    format: "%(timestamp)s %(levelname)s %(name)s %(message)s"
    timestamp: "timestamp"

handlers:
  streamHandler:
    class: logging.StreamHandler
    formatter: jsonFormatter
    stream: ext://sys.stdout

root:
  level: INFO
  handlers: [streamHandler]

Centralized Logging:

# FluentBit DaemonSet for log collection
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/*solace-agent-mesh*.log
        Parser            docker
        Tag               sam.*
    
    [OUTPUT]
        Name              es
        Match             sam.*
        Host              elasticsearch.logging.svc
        Port              9200
        Index             sam-logs
        Type              _doc

Metrics and Monitoring

Prometheus Metrics:

apiVersion: v1
kind: Service
metadata:
  name: solace-agent-mesh-metrics
  labels:
    app: solace-agent-mesh
spec:
  ports:
    - name: metrics
      port: 9090
      targetPort: 9090
  selector:
    app: solace-agent-mesh

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: solace-agent-mesh
spec:
  selector:
    matchLabels:
      app: solace-agent-mesh
  endpoints:
    - port: metrics
      interval: 30s

Key Metrics to Monitor:

Request rate and latency
Error rates (4xx, 5xx)
LLM API call latency and costs
Database connection pool usage
Message queue depth
CPU and memory usage
Disk I/O and storage usage

Alerting

PrometheusRule for Alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: sam-alerts
spec:
  groups:
    - name: solace-agent-mesh
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            rate(http_requests_total{status=~"5.."}[5m]) > 0.05
          for: 5m
          annotations:
            summary: "High error rate detected"
          labels:
            severity: critical
        
        - alert: PodCrashLooping
          expr: |
            rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          annotations:
            summary: "Pod is crash looping"
          labels:
            severity: warning
        
        - alert: DatabaseConnectionFailure
          expr: |
            sam_database_connection_failures_total > 0
          for: 2m
          annotations:
            summary: "Database connection failures"
          labels:
            severity: critical

Distributed Tracing

OpenTelemetry Integration:

environment:
  OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger-collector:4317
  OTEL_SERVICE_NAME: solace-agent-mesh
  OTEL_TRACES_SAMPLER: parentbased_traceidratio
  OTEL_TRACES_SAMPLER_ARG: "0.1"  # Sample 10% of traces

Operational Excellence

CI/CD Pipeline

Example GitLab CI:

stages:
  - test
  - build
  - deploy

variables:
  DOCKER_IMAGE: gcr.io/my-project/solace-agent-mesh

test:
  stage: test
  script:
    - pytest tests/
    - trivy image --severity HIGH,CRITICAL solace/solace-agent-mesh:latest

build:
  stage: build
  script:
    - docker build -t $DOCKER_IMAGE:$CI_COMMIT_SHA .
    - docker push $DOCKER_IMAGE:$CI_COMMIT_SHA
  only:
    - main

deploy-staging:
  stage: deploy
  script:
    - helm upgrade --install sam-staging ./charts/sam \
        --set image.tag=$CI_COMMIT_SHA \
        -f values-staging.yaml \
        -n staging
  only:
    - main

deploy-production:
  stage: deploy
  script:
    - helm upgrade --install sam-prod ./charts/sam \
        --set image.tag=$CI_COMMIT_SHA \
        -f values-production.yaml \
        -n production
  when: manual
  only:
    - main

Disaster Recovery

Backup Strategy:

#!/bin/bash
# backup-sam.sh - Daily backup script

# Backup PostgreSQL
pg_dump -h $DB_HOST -U $DB_USER -d sam | \
  gzip > /backups/sam-db-$(date +%Y%m%d).sql.gz

# Sync to S3
aws s3 sync /backups/ s3://sam-backups/$(date +%Y/%m)/

# Retain backups for 30 days
find /backups/ -name "*.sql.gz" -mtime +30 -delete

Recovery Testing:

# Test database restore quarterly
psql -h $TEST_DB_HOST -U $DB_USER -d sam < backup.sql

# Verify artifact restoration
aws s3 sync s3://sam-backups/artifacts/ /tmp/test-restore/

Configuration Management

GitOps with ArgoCD:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: solace-agent-mesh
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/sam-configs
    targetRevision: HEAD
    path: kubernetes/production
    helm:
      valueFiles:
        - values-production.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: solace-agent-mesh
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Documentation

Maintain runbooks for common scenarios:

Deployment procedures
Rollback procedures
Incident response playbooks
Disaster recovery procedures
Capacity planning guidelines
Security incident response

Cost Optimization

Resource Right-Sizing

Monitor and adjust:

# Analyze resource usage
kubectl top pods -n solace-agent-mesh
kubectl top nodes

# Use Vertical Pod Autoscaler for recommendations
kubectl describe vpa sam-vpa -n solace-agent-mesh

Storage Optimization

Lifecycle Policies:

{
  "Rules": [
    {
      "Id": "MoveToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 180,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "Id": "DeleteOldArtifacts",
      "Status": "Enabled",
      "Expiration": {
        "Days": 365
      }
    }
  ]
}

LLM Cost Management

Monitor and optimize:

# Use cheaper models for non-critical tasks
models:
  planning:
    model: openai/gpt-4  # Premium model
  general:
    model: openai/gpt-3.5-turbo  # Cost-effective
  
# Set token limits
max_tokens: 2000

# Enable caching
cache_strategy: "5m"

Compliance and Governance

Audit Logging

Enable comprehensive audit trails:

# Kubernetes audit policy
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
    namespaces: ["solace-agent-mesh"]

Data Residency

Ensure data stays in required regions:

# Deploy in specific region
REGION=eu-west-1

# Use regional database
DATABASE_URL=postgresql://...:5432/sam?options=-c%20timezone=UTC

# Use regional S3 bucket
ARTIFACT_STORAGE_S3_REGION=eu-west-1

Access Control

Implement principle of least privilege:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: sam-operator
  namespace: solace-agent-mesh
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]

Checklist

Before going to production, verify:

Security Checklist

Secrets stored in secure vault (not in code)
TLS enabled for all connections
Containers run as non-root
Network policies configured
Image vulnerability scanning enabled
Authentication/authorization configured
Database encryption at rest enabled
S3 bucket encryption enabled
Regular security audits scheduled

Reliability Checklist

Observability Checklist

Operational Checklist

Next Steps

Health Checks

Configure comprehensive health monitoring

Observability

Set up monitoring and tracing

Logging

Configure application logging

Configuration

Complete configuration reference

Getting Started

Installation & Configuration

Core Concepts

Components

Built-in Tools

Developer Guides

Deployment

Enterprise Features

​Production Best Practices

​Security Best Practices

​Secrets Management

​TLS/SSL Configuration

​Container Security

​Authentication and Authorization

​Data Protection

​High Availability and Reliability

​Multi-Replica Deployment

​Queue Configuration

​Health Checks and Auto-Recovery

​Resource Quotas and Limits

​Database High Availability

​Performance Optimization

​Database Performance

​Object Storage Optimization

​Autoscaling

​Caching Strategies

​Monitoring and Observability

​Application Logging

​Metrics and Monitoring

​Alerting

​Distributed Tracing

​Operational Excellence

​CI/CD Pipeline

​Disaster Recovery

​Configuration Management

​Documentation

​Cost Optimization

​Resource Right-Sizing

​Storage Optimization

​LLM Cost Management

​Compliance and Governance

​Audit Logging

​Data Residency

​Access Control

​Checklist

​Next Steps

Health Checks

Observability

Logging

Configuration

Build docs developers (and LLMs) love

Production Best Practices

Security Best Practices

Secrets Management

TLS/SSL Configuration

Container Security

Authentication and Authorization

Data Protection

High Availability and Reliability

Multi-Replica Deployment

Queue Configuration

Health Checks and Auto-Recovery

Resource Quotas and Limits

Database High Availability

Performance Optimization

Database Performance

Object Storage Optimization

Autoscaling

Caching Strategies

Monitoring and Observability

Application Logging

Metrics and Monitoring

Alerting

Distributed Tracing

Operational Excellence

CI/CD Pipeline

Disaster Recovery

Configuration Management

Documentation

Cost Optimization

Resource Right-Sizing

Storage Optimization

LLM Cost Management

Compliance and Governance

Audit Logging

Data Residency

Access Control

Checklist

Next Steps