Overview
The server-management skill teaches server management principles for production operations. It covers process management, monitoring strategy, log management, scaling decisions, and troubleshooting procedures.What This Skill Provides
- Process Management: PM2, systemd, Docker, Kubernetes selection
- Monitoring Principles: What to monitor and alert strategies
- Log Management: Logging strategy and rotation
- Scaling Decisions: When and how to scale
- Health Checks: Implementing service health endpoints
- Security Principles: Server security best practices
- Troubleshooting: Systematic problem diagnosis
Philosophy
Learn to THINK, not memorize commands. Server management is about understanding principles and making informed decisions based on your specific context.Process Management Principles
Tool Selection
| Scenario | Tool |
|---|---|
| Node.js app | PM2 (clustering, reload) |
| Any app | systemd (Linux native) |
| Containers | Docker/Podman |
| Orchestration | Kubernetes, Docker Swarm |
Process Management Goals
| Goal | What It Means |
|---|---|
| Restart on crash | Auto-recovery |
| Zero-downtime reload | No service interruption |
| Clustering | Use all CPU cores |
| Persistence | Survive server reboot |
Monitoring Principles
What to Monitor
| Category | Key Metrics |
|---|---|
| Availability | Uptime, health checks |
| Performance | Response time, throughput |
| Errors | Error rate, types |
| Resources | CPU, memory, disk |
Alert Severity Strategy
| Level | Response |
|---|---|
| Critical | Immediate action |
| Warning | Investigate soon |
| Info | Review daily |
Monitoring Tool Selection
| Need | Options |
|---|---|
| Simple/Free | PM2 metrics, htop |
| Full observability | Grafana, Datadog |
| Error tracking | Sentry |
| Uptime | UptimeRobot, Pingdom |
Use Cases
When to Use This Skill
- Setting up server infrastructure
- Implementing monitoring and alerting
- Troubleshooting server issues
- Planning scaling strategies
- Securing server environments
- Managing logs and processes
Example Scenarios
- Setup: “Configure PM2 for a Node.js application”
- Monitoring: “Set up monitoring for this production server”
- Scaling: “Server CPU is at 90%, should I scale?”
- Troubleshooting: “Service keeps crashing, help diagnose”
Log Management Principles
Log Strategy
| Log Type | Purpose |
|---|---|
| Application logs | Debug, audit |
| Access logs | Traffic analysis |
| Error logs | Issue detection |
Log Principles
- Rotate logs to prevent disk fill
- Structured logging (JSON) for parsing
- Appropriate levels (error/warn/info/debug)
- No sensitive data in logs
Scaling Decisions
When to Scale
| Symptom | Solution |
|---|---|
| High CPU | Add instances (horizontal) |
| High memory | Increase RAM or fix leak |
| Slow response | Profile first, then scale |
| Traffic spikes | Auto-scaling |
Scaling Strategy
| Type | When to Use |
|---|---|
| Vertical | Quick fix, single instance |
| Horizontal | Sustainable, distributed |
| Auto | Variable traffic |
Health Check Principles
What Constitutes Healthy
| Check | Meaning |
|---|---|
| HTTP 200 | Service responding |
| Database connected | Data accessible |
| Dependencies OK | External services reachable |
| Resources OK | CPU/memory not exhausted |
Health Check Implementation
- Simple: Just return 200
- Deep: Check all dependencies
- Choose based on load balancer needs
Security Principles
| Area | Principle |
|---|---|
| Access | SSH keys only, no passwords |
| Firewall | Only needed ports open |
| Updates | Regular security patches |
| Secrets | Environment vars, not files |
| Audit | Log access and changes |
Troubleshooting Priority
When something’s wrong:- Check if running (process status)
- Check logs (error messages)
- Check resources (disk, memory, CPU)
- Check network (ports, DNS)
- Check dependencies (database, APIs)
Anti-Patterns to Avoid
| ❌ Don’t | ✅ Do |
|---|---|
| Run as root | Use non-root user |
| Ignore logs | Set up log rotation |
| Skip monitoring | Monitor from day one |
| Manual restarts | Auto-restart config |
| No backups | Regular backup schedule |
Process Management Examples
PM2 (Node.js)
systemd (Linux)
Docker
Monitoring Setup
Basic Health Endpoint
Related Skills
- deployment-procedures: Deployment workflows
- bash-linux: Linux server commands
- powershell-windows: Windows server management
- clean-code: Service code quality
Which Agents Use This Skill
- devops-engineer: Primary user for all server operations
Best Practices
- Automate everything possible
- Monitor from day one
- Secure by default
- Document your setup
- Test failure scenarios
- Backup regularly
Resource Monitoring
Key Metrics
| Metric | Threshold | Action |
|---|---|---|
| CPU | >80% sustained | Investigate/scale |
| Memory | >90% | Check for leaks |
| Disk | >85% | Clean/expand |
| Network | Saturated | Check traffic |
Tools Available
- Read, Write, Edit: For config files
- Glob, Grep: For searching logs
- Bash: For server commands
Remember: A well-managed server is boring. That’s the goal.
