Overview
k8s-scheduler consists of two main components:- Server: Go backend serving REST API and React SPA
- Operator: Kubernetes operator managing UserDeployment custom resources
Deployment Flow
Infrastructure Layers
| Layer | Components |
|---|---|
| 1-infrastructure/ | VPC, EKS, RDS PostgreSQL, Vault, Tailscale |
| 2-platform/ | ALB Controller, cert-manager, External Secrets Operator |
| 3-apps/ | Traefik, external-dns, ArgoCD, Vault secrets |
Prerequisites
Before deploying k8s-scheduler, ensure the following components are installed:Required Components
Required Components
- PostgreSQL - Database for users, orgs, teams, deployments
- AWS ALB Controller - Creates ALB from Ingress resources (on AWS)
- Traefik - Routes wildcard traffic to user deployments
- External DNS - Auto-creates DNS records from ingress
- cert-manager - TLS certificates for server and deployments
- Cloudflare - DNS provider for External DNS and cert-manager
Recommended Components
Recommended Components
- HashiCorp Vault - Centralized secrets management
- External Secrets Operator - Syncs Vault secrets to K8s Secrets
- Vault Agent Injector - Injects secrets into server pod
Optional Components
Optional Components
- Stripe - Subscription billing
- SendGrid/SMTP - Team invitation emails
Step 1: Configure Terraform
Step 2: Setup Vault Policy
The setup script creates:- Vault policy
k8s-scheduler- grants access to user secret paths - Kubernetes auth role
k8s-scheduler- binds policy to service account
~/.vault-secrets/vault.env with your Vault token:
~/.vault-secrets/vault.env
Vault Secret Paths
Terraform writes these paths automatically:| Path | Keys | Required? |
|---|---|---|
secret/k8s-scheduler/database | connection_string | Yes |
secret/k8s-scheduler/google | client_id, client_secret | Yes (unless DEV_MODE) |
secret/k8s-scheduler/email | provider, smtp_host, smtp_port, smtp_user, smtp_password, smtp_from | No |
secret/k8s-scheduler/ai | anthropic_api_key | No |
secret/k8s-scheduler/stripe | api_key, webhook_secret | No |
secret/k8s-scheduler/secrets | keycloak_admin_password, grafana_cloud_prometheus_password, grafana_cloud_loki_password | No |
All paths must exist in Vault even if empty. The Vault Agent template will fail if a path is missing.
Step 3: Deploy with Helm
Install
- Custom Resource Definitions (UserDeployment, AgentTask, Workflow)
- RBAC (ServiceAccounts, ClusterRole, ClusterRoleBinding)
- Server deployment with Vault Agent sidecar
- Operator deployment
- ClusterSecretStore for External Secrets Operator
- Ingress (ALB) for the server
- ConfigMaps for templates
Helm Values
View all available configuration options:Key Helm Values
| Value | Description | Default |
|---|---|---|
domain | Required. Base domain for the app | example.com |
image.server.repository | Server container image | ghcr.io/opsnorth/k8s-scheduler-server |
image.operator.repository | Operator container image | ghcr.io/opsnorth/k8s-scheduler-operator |
image.server.tag | Server image tag | latest |
server.replicas | Server pod count | 1 |
operator.replicas | Operator pod count | 1 |
operator.leaderElect | Enable leader election for HA | true |
ingress.enabled | Create ALB Ingress | true |
ingress.className | Ingress class | alb |
vault.agentInject | Enable Vault Agent sidecar | true |
vault.address | Vault server URL | http://vault.vault.svc.cluster.local:8200 |
secretStore.enabled | Create ClusterSecretStore | true |
secretStore.name | ClusterSecretStore name | vault-backend |
session.backend | Session storage backend | postgres |
Step 4: Deploy with Raw Manifests
Helm is the recommended deployment method. Use raw manifests only for advanced customization.
Step 5: Verify Deployment
Testing
Create a test deployment to verify the operator:test-deployment.yaml
Lifecycle Management
Upgrade
Rollback
Uninstall
Production Considerations
High Availability
Increase replicas for server and operator:
Resource Limits
Adjust based on load:
values.yaml
Database
Use managed PostgreSQL:
- AWS RDS
- Google Cloud SQL
- Azure Database for PostgreSQL
Session Store
Use persistent session backend:Avoid
memory backend in production.TLS Certificates
cert-manager auto-provisions TLS:
- Server ingress
- User deployment ingresses
Monitoring
Enable Prometheus metrics:Requires Prometheus Operator.
Troubleshooting
Server pod won't start
Server pod won't start
Symptoms: Server pod stuck in
Init:0/1 or CrashLoopBackOffCauses:- Vault secrets missing or incorrect
- Database connection failed
- Vault Agent can’t authenticate
Operator not creating resources
Operator not creating resources
Symptoms: UserDeployment created but no pods/services appearCauses:
- RBAC permissions missing
- Operator not running
- CRD not installed
Ingress not getting external IP
Ingress not getting external IP
Symptoms: Ingress created but no ALB provisionedCauses:
- AWS Load Balancer Controller not running
- Incorrect ingress annotations
- IAM permissions missing
External Secrets not syncing
External Secrets not syncing
Symptoms: ExternalSecret shows
SecretSyncedErrorCauses:- ClusterSecretStore misconfigured
- Vault auth role missing
- Secret path doesn’t exist in Vault
Next Steps
Configuration
Configure environment variables and settings
Dependencies
Learn about platform dependencies