Scaling Dimensions
Horizontal Scaling
Process Count: Number of concurrent Processes runningIdeal for handling more requests and providing redundancy
Vertical Scaling
CPU: CPU allocation in units (1000 = 1 full CPU)Memory: RAM allocation in megabytesGPU: Number of GPUs per processIdeal for resource-intensive operations
Initial Scale Configuration
Define default scale settings in yourconvox.yml:
Static count values are only used on first deployment. Subsequent count changes must be made via the CLI to prevent configuration drift.
Scale Attributes
| Attribute | Unit | Description |
|---|---|---|
| count | number | Number of processes (1-n) |
| cpu | CPU units | 1000 units = 1 full CPU core |
| memory | MB | RAM allocation in megabytes |
| gpu | number | GPUs allocated per process (requires GPU-enabled rack) |
Manual Scaling
Manually adjust your service scale using the Convox CLI.Check Current Scale
View the current scale for all services:Scale Process Count
Change the number of processes for a service:Scale CPU and Memory
To change CPU or memory allocation, update yourconvox.yml and deploy:
Autoscaling
Configure autoscaling to automatically adjust process count based on resource utilization.Basic Autoscaling
Define a scale range and target metrics:Autoscaling Configuration
| Attribute | Example | Description |
|---|---|---|
| count | 2-10 | Min-Max range for process count |
| targets.cpu | 70 | Target CPU utilization percentage |
| targets.memory | 80 | Target memory utilization percentage |
How Autoscaling Works
Scaling Formula
Kubernetes uses this formula to calculate desired replicas:- Current replicas: 4
- Current CPU utilization: 85%
- Target CPU utilization: 70%
- Calculation: ceil[4 × (85 / 70)] = ceil[4.86] = 5 replicas
Multiple Targets
You can set both CPU and memory targets:GPU Scaling
For machine learning, video processing, and scientific computing workloads, Convox supports GPU allocation.Prerequisites
Use GPU-capable instances
Your rack must run on GPU-enabled instances:
- AWS: p3, p4, g4, g5 families
- GCP: N1 with GPU accelerators
- Azure: NC, ND, NV series
Basic GPU Configuration
GPU with Autoscaling
Combine GPU allocation with autoscaling:GPU Considerations
Whole Units Only
GPUs are allocated as whole units. You cannot request fractional GPUs.
Node Affinity
Processes requesting GPUs will only be scheduled on nodes with available GPUs.
CUDA Support
Use base images that include the NVIDIA CUDA toolkit (e.g.,
nvidia/cuda:11.8.0-runtime-ubuntu22.04).Resource Costs
GPU instances are significantly more expensive than standard instances.
Example GPU Configuration
- Runs 2 processes
- Each process gets 2 GPUs
- Total GPU requirement: 4 GPUs
Advanced Autoscaling
Custom Metrics with Datadog
Autoscale based on custom metrics from Datadog.Prerequisites
Install Datadog
Install Datadog and the Datadog Cluster Agent on your rack. Use
datadog-agent-all-features.yaml for the complete setup.Configure external metrics
Follow Datadog’s documentation to configure the external metrics provider.
Create DatadogMetric
Define a custom metric:Configure Service
Reference the metric in yourconvox.yml:
Deploy Datadog Service
Create a service for your app to send metrics:Test Autoscaling
Generate load to trigger scaling:Scaling Strategies
High Availability
For mission-critical services, maintain minimum redundancy:- Always at least 3 instances running
- Survives multiple instance failures
- Rolling updates maintain capacity
Cost Optimization
For development or low-traffic environments:- Minimal resource usage when idle
- Aggressive scaling targets reduce costs
- Suitable for non-critical workloads
Burst Capacity
For services with unpredictable traffic spikes:- Low baseline for normal traffic
- Can scale rapidly for traffic spikes
- Conservative targets trigger scaling early
Best Practices
Start Conservative
Begin with lower resource allocations and scale up based on actual usage patterns.
Monitor Performance
Use monitoring tools to understand your application’s resource utilization over time.
Test Scaling
Load test your application to verify autoscaling behavior before production.
Set Appropriate Ranges
Define min/max count ranges that prevent both resource starvation and excessive costs.
Use Multiple Services
Separate different workload types into different services with independent scaling.
Plan for Failure
Always run at least 2 processes for critical services to ensure availability.
Consider Costs
Balance performance needs with infrastructure costs, especially for GPU instances.
Document Decisions
Record why you chose specific scale settings for future reference.
Troubleshooting
Processes Not Scaling
Symptoms: Process count doesn’t change despite high load Possible causes:- Autoscaling not configured
- Insufficient rack resources
- Metrics not being collected
- Verify autoscaling config in
convox.yml - Check rack capacity:
convox rack - Review Kubernetes HPA status:
kubectl get hpa -n convox-myapp
Autoscaling Too Aggressive
Symptoms: Constantly scaling up and down Possible causes:- Targets set too low
- Application has bursty workload patterns
- Not enough processes at minimum
- Increase target percentages (e.g., 70 → 80)
- Increase minimum process count
- Smooth out workload with queues
Out of Memory Errors
Symptoms: Processes crash with OOM errors Possible causes:- Memory allocation too low
- Memory leaks in application
- Unexpected traffic spikes
- Increase memory allocation in
convox.yml - Profile application for memory leaks
- Implement better autoscaling targets
- Add memory-based autoscaling
GPU Not Available
Symptoms: Processes requesting GPU stuck in pending Possible causes:- No GPU nodes available
- NVIDIA device plugin not installed
- All GPUs already allocated
- Verify GPU instances in rack
- Check NVIDIA plugin:
kubectl get pods -n kube-system | grep nvidia - Review node resources:
kubectl describe nodes - Scale down other GPU processes
Next Steps
Health Checks
Configure health monitoring for scaled services
Rolling Updates
Understand how scaling affects deployments
Monitoring
Set up comprehensive monitoring and alerting
Load Balancing
Learn how traffic is distributed across processes