Deployment Guide

This guide covers deploying k8s-scheduler to a production Kubernetes cluster using Helm or raw manifests.

Overview

k8s-scheduler consists of two main components:

Server: Go backend serving REST API and React SPA
Operator: Kubernetes operator managing UserDeployment custom resources

Deployment Flow

Infrastructure Layers

Layer	Components
1-infrastructure/	VPC, EKS, RDS PostgreSQL, Vault, Tailscale
2-platform/	ALB Controller, cert-manager, External Secrets Operator
3-apps/	Traefik, external-dns, ArgoCD, Vault secrets

See the opsnorth/infra repository for full Terraform configuration.

Prerequisites

Before deploying k8s-scheduler, ensure the following components are installed:

Required Components

PostgreSQL - Database for users, orgs, teams, deployments
AWS ALB Controller - Creates ALB from Ingress resources (on AWS)
Traefik - Routes wildcard traffic to user deployments
External DNS - Auto-creates DNS records from ingress
cert-manager - TLS certificates for server and deployments
Cloudflare - DNS provider for External DNS and cert-manager

Recommended Components

HashiCorp Vault - Centralized secrets management
External Secrets Operator - Syncs Vault secrets to K8s Secrets
Vault Agent Injector - Injects secrets into server pod

Optional Components

Stripe - Subscription billing
SendGrid/SMTP - Team invitation emails

Verify all components are running:

# AWS Load Balancer Controller
kubectl get pods -n kube-system | grep aws-load-balancer

# External Secrets Operator
kubectl get pods -n external-secrets

# Vault + Agent Injector
kubectl get pods -n vault

# Traefik
kubectl get pods -n traefik

# cert-manager
kubectl get pods -n cert-manager

# external-dns
kubectl get pods -n external-dns

Step 1: Configure Terraform

Clone infrastructure repository

git clone https://github.com/opsnorth/infra.git
cd infra

Copy environment template

cp .env.example .env

The .env file contains all secrets and credentials (gitignored).

Configure secrets

Edit .env with your credentials:

.env

# Tailscale VPN
export TF_VAR_tailscale_auth_key="tskey-auth-..."

# Cloudflare DNS
export TF_VAR_cloudflare_api_token="..."

# GitHub App (for ArgoCD)
export TF_VAR_github_app_id="..."
export TF_VAR_github_app_installation_id="..."
export TF_VAR_github_app_private_key_file="~/.github/github-app.pem"

# Google OAuth
export TF_VAR_google_client_id="your-client-id.apps.googleusercontent.com"
export TF_VAR_google_client_secret="your-client-secret"

Deploy infrastructure

./deploy.sh

This script sources .env and applies Terraform for all three layers.

Step 2: Setup Vault Policy

This is a one-time setup. Required before deploying the application.

The setup script creates:

Vault policy k8s-scheduler - grants access to user secret paths
Kubernetes auth role k8s-scheduler - binds policy to service account

./scripts/setup-vault.sh

Prerequisite: Create ~/.vault-secrets/vault.env with your Vault token:

~/.vault-secrets/vault.env

VAULT_TOKEN=hvs.your-vault-root-token

Vault Secret Paths

Terraform writes these paths automatically:

Path	Keys	Required?
`secret/k8s-scheduler/database`	`connection_string`	Yes
`secret/k8s-scheduler/google`	`client_id`, `client_secret`	Yes (unless DEV_MODE)
`secret/k8s-scheduler/email`	`provider`, `smtp_host`, `smtp_port`, `smtp_user`, `smtp_password`, `smtp_from`	No
`secret/k8s-scheduler/ai`	`anthropic_api_key`	No
`secret/k8s-scheduler/stripe`	`api_key`, `webhook_secret`	No
`secret/k8s-scheduler/secrets`	`keycloak_admin_password`, `grafana_cloud_prometheus_password`, `grafana_cloud_loki_password`	No

All paths must exist in Vault even if empty. The Vault Agent template will fail if a path is missing.

vault kv put secret/k8s-scheduler/ai anthropic_api_key=""
vault kv put secret/k8s-scheduler/stripe api_key="" webhook_secret=""

Step 3: Deploy with Helm

Install

helm install k8s-scheduler ./charts/k8s-scheduler \
  -n scheduler-system --create-namespace \
  --set domain=yourdomain.com

The Helm chart deploys:

Custom Resource Definitions (UserDeployment, AgentTask, Workflow)
RBAC (ServiceAccounts, ClusterRole, ClusterRoleBinding)
Server deployment with Vault Agent sidecar
Operator deployment
ClusterSecretStore for External Secrets Operator
Ingress (ALB) for the server
ConfigMaps for templates

Helm Values

View all available configuration options:

helm show values ./charts/k8s-scheduler

helm install k8s-scheduler ./charts/k8s-scheduler \
  -n scheduler-system --create-namespace \
  --set domain=yourdomain.com

Key Helm Values

Value	Description	Default
`domain`	Required. Base domain for the app	`example.com`
`image.server.repository`	Server container image	`ghcr.io/opsnorth/k8s-scheduler-server`
`image.operator.repository`	Operator container image	`ghcr.io/opsnorth/k8s-scheduler-operator`
`image.server.tag`	Server image tag	`latest`
`server.replicas`	Server pod count	`1`
`operator.replicas`	Operator pod count	`1`
`operator.leaderElect`	Enable leader election for HA	`true`
`ingress.enabled`	Create ALB Ingress	`true`
`ingress.className`	Ingress class	`alb`
`vault.agentInject`	Enable Vault Agent sidecar	`true`
`vault.address`	Vault server URL	`http://vault.vault.svc.cluster.local:8200`
`secretStore.enabled`	Create ClusterSecretStore	`true`
`secretStore.name`	ClusterSecretStore name	`vault-backend`
`session.backend`	Session storage backend	`postgres`

Step 4: Deploy with Raw Manifests

Helm is the recommended deployment method. Use raw manifests only for advanced customization.

kubectl apply -k manifests/

Manifests are organized by component:

manifests/
├── namespace.yaml
├── crds/
│   ├── userdeployment-crd.yaml
│   ├── agenttask-crd.yaml
│   └── workflow-crd.yaml
├── rbac/
│   ├── service-account.yaml
│   ├── server-service-account.yaml
│   ├── cluster-role.yaml
│   └── cluster-role-binding.yaml
├── configmaps/
│   ├── server-config.yaml
│   └── deployment-templates.yaml
├── secrets/
│   └── cluster-secret-store.yaml
├── deployments/
│   ├── server.yaml
│   └── operator.yaml
└── kustomization.yaml

Step 5: Verify Deployment

Check pod status

kubectl get pods -n scheduler-system

Expected output:

NAME                                      READY   STATUS    RESTARTS   AGE
k8s-scheduler-operator-xxxxx-xxxxx        1/1     Running   0          30s
k8s-scheduler-server-xxxxx-xxxxx          2/2     Running   0          30s

The server pod shows 2/2 because Vault Agent runs as a sidecar.

Check server logs

kubectl logs -n scheduler-system -l app=k8s-scheduler-server -c server --tail=50

Check operator logs

kubectl logs -n scheduler-system -l app=k8s-scheduler-operator --tail=50

Access the application

Application is available at:

https://app.<your-domain>

Testing

Create a test deployment to verify the operator:

test-deployment.yaml

apiVersion: scheduler.opsnorth.io/v1alpha1
kind: UserDeployment
metadata:
  name: test-deployment
  namespace: scheduler-system
spec:
  userId: "test-user"
  template: "nginx"
  tier: "free"
  desiredState: "running"

# Create
kubectl apply -f test-deployment.yaml

# Watch status
kubectl get userdeployment test-deployment -n scheduler-system -w

# Cleanup
kubectl delete userdeployment test-deployment -n scheduler-system

Lifecycle Management

Upgrade

helm upgrade k8s-scheduler ./charts/k8s-scheduler \
  -n scheduler-system \
  --set domain=yourdomain.com

Rollback

# List revisions
helm history k8s-scheduler -n scheduler-system

# Rollback to specific revision
helm rollback k8s-scheduler <revision> -n scheduler-system

Uninstall

# Remove Helm release + CRDs + namespace
./scripts/uninstall.sh

# Full teardown including Vault policy
./scripts/uninstall.sh --all

Production Considerations

High Availability

Increase replicas for server and operator:

--set server.replicas=3 \
--set operator.replicas=2 \
--set operator.leaderElect=true

Resource Limits

Adjust based on load:

values.yaml

server:
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

Database

Use managed PostgreSQL:

AWS RDS
Google Cloud SQL
Azure Database for PostgreSQL

Enable SSL connections in production.

Session Store

Use persistent session backend:

--set session.backend=postgres
# or
--set session.backend=redis

Avoid memory backend in production.

TLS Certificates

cert-manager auto-provisions TLS:

Server ingress
User deployment ingresses

Configure DNS01 challenge with Cloudflare.

Monitoring

Enable Prometheus metrics:

--set monitoring.enabled=true

Requires Prometheus Operator.

Troubleshooting

Server pod won't start

Symptoms: Server pod stuck in Init:0/1 or CrashLoopBackOffCauses:

Vault secrets missing or incorrect
Database connection failed
Vault Agent can’t authenticate

Solutions:

# Check Vault Agent logs
kubectl logs -n scheduler-system -l app=k8s-scheduler-server -c vault-agent

# Verify Vault secrets exist
vault kv get secret/k8s-scheduler/database
vault kv get secret/k8s-scheduler/google

# Check database connectivity
kubectl run -it --rm psql --image=postgres:15 -- psql $DATABASE_DSN

Operator not creating resources

Symptoms: UserDeployment created but no pods/services appearCauses:

RBAC permissions missing
Operator not running
CRD not installed

Solutions:

# Check operator logs
kubectl logs -n scheduler-system -l app=k8s-scheduler-operator

# Verify CRD exists
kubectl get crd userdeployments.scheduler.opsnorth.io

# Check ClusterRole
kubectl get clusterrole k8s-scheduler-operator

Ingress not getting external IP

Symptoms: Ingress created but no ALB provisionedCauses:

AWS Load Balancer Controller not running
Incorrect ingress annotations
IAM permissions missing

Solutions:

# Check ALB controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller

# Verify ingress
kubectl describe ingress -n scheduler-system k8s-scheduler-server

# Check ALB controller IAM role
kubectl get sa -n kube-system aws-load-balancer-controller -o yaml

External Secrets not syncing

Symptoms: ExternalSecret shows SecretSyncedErrorCauses:

ClusterSecretStore misconfigured
Vault auth role missing
Secret path doesn’t exist in Vault

Solutions:

# Check ClusterSecretStore
kubectl get clustersecretstore vault-backend -o yaml

# Verify Vault auth role
vault read auth/kubernetes/role/k8s-scheduler

# Check ESO logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets

Getting Started

Core Concepts

Components

Deployment

API Reference

Overview

Deployment Flow

Infrastructure Layers

Prerequisites

Step 1: Configure Terraform

Step 2: Setup Vault Policy

Vault Secret Paths

Step 3: Deploy with Helm

Install

Helm Values

Key Helm Values

Step 4: Deploy with Raw Manifests

Step 5: Verify Deployment

Testing

Lifecycle Management

Upgrade

Rollback

Uninstall

Production Considerations

High Availability

Resource Limits

Database

Session Store

TLS Certificates

Monitoring

Troubleshooting

Next Steps

Configuration

Dependencies

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Components

Deployment

API Reference

​Overview

​Deployment Flow

​Infrastructure Layers

​Prerequisites

​Step 1: Configure Terraform

​Step 2: Setup Vault Policy

​Vault Secret Paths

​Step 3: Deploy with Helm

​Install

​Helm Values

​Key Helm Values

​Step 4: Deploy with Raw Manifests

​Step 5: Verify Deployment

​Testing

​Lifecycle Management

​Upgrade

​Rollback

​Uninstall

​Production Considerations

High Availability

Resource Limits

Database

Session Store

TLS Certificates

Monitoring

​Troubleshooting

​Next Steps

Configuration

Dependencies

Build docs developers (and LLMs) love

Overview

Deployment Flow

Infrastructure Layers

Prerequisites

Step 1: Configure Terraform

Step 2: Setup Vault Policy

Vault Secret Paths

Step 3: Deploy with Helm

Install

Helm Values

Key Helm Values

Step 4: Deploy with Raw Manifests

Step 5: Verify Deployment

Testing

Lifecycle Management

Upgrade

Rollback

Uninstall

Production Considerations

Troubleshooting

Next Steps