Skip to main content

Overview

The server-management skill teaches server management principles for production operations. It covers process management, monitoring strategy, log management, scaling decisions, and troubleshooting procedures.

What This Skill Provides

  • Process Management: PM2, systemd, Docker, Kubernetes selection
  • Monitoring Principles: What to monitor and alert strategies
  • Log Management: Logging strategy and rotation
  • Scaling Decisions: When and how to scale
  • Health Checks: Implementing service health endpoints
  • Security Principles: Server security best practices
  • Troubleshooting: Systematic problem diagnosis

Philosophy

Learn to THINK, not memorize commands. Server management is about understanding principles and making informed decisions based on your specific context.

Process Management Principles

Tool Selection

ScenarioTool
Node.js appPM2 (clustering, reload)
Any appsystemd (Linux native)
ContainersDocker/Podman
OrchestrationKubernetes, Docker Swarm

Process Management Goals

GoalWhat It Means
Restart on crashAuto-recovery
Zero-downtime reloadNo service interruption
ClusteringUse all CPU cores
PersistenceSurvive server reboot

Monitoring Principles

What to Monitor

CategoryKey Metrics
AvailabilityUptime, health checks
PerformanceResponse time, throughput
ErrorsError rate, types
ResourcesCPU, memory, disk

Alert Severity Strategy

LevelResponse
CriticalImmediate action
WarningInvestigate soon
InfoReview daily

Monitoring Tool Selection

NeedOptions
Simple/FreePM2 metrics, htop
Full observabilityGrafana, Datadog
Error trackingSentry
UptimeUptimeRobot, Pingdom

Use Cases

When to Use This Skill

  • Setting up server infrastructure
  • Implementing monitoring and alerting
  • Troubleshooting server issues
  • Planning scaling strategies
  • Securing server environments
  • Managing logs and processes

Example Scenarios

  1. Setup: “Configure PM2 for a Node.js application”
  2. Monitoring: “Set up monitoring for this production server”
  3. Scaling: “Server CPU is at 90%, should I scale?”
  4. Troubleshooting: “Service keeps crashing, help diagnose”

Log Management Principles

Log Strategy

Log TypePurpose
Application logsDebug, audit
Access logsTraffic analysis
Error logsIssue detection

Log Principles

  1. Rotate logs to prevent disk fill
  2. Structured logging (JSON) for parsing
  3. Appropriate levels (error/warn/info/debug)
  4. No sensitive data in logs

Scaling Decisions

When to Scale

SymptomSolution
High CPUAdd instances (horizontal)
High memoryIncrease RAM or fix leak
Slow responseProfile first, then scale
Traffic spikesAuto-scaling

Scaling Strategy

TypeWhen to Use
VerticalQuick fix, single instance
HorizontalSustainable, distributed
AutoVariable traffic

Health Check Principles

What Constitutes Healthy

CheckMeaning
HTTP 200Service responding
Database connectedData accessible
Dependencies OKExternal services reachable
Resources OKCPU/memory not exhausted

Health Check Implementation

  • Simple: Just return 200
  • Deep: Check all dependencies
  • Choose based on load balancer needs

Security Principles

AreaPrinciple
AccessSSH keys only, no passwords
FirewallOnly needed ports open
UpdatesRegular security patches
SecretsEnvironment vars, not files
AuditLog access and changes

Troubleshooting Priority

When something’s wrong:
  1. Check if running (process status)
  2. Check logs (error messages)
  3. Check resources (disk, memory, CPU)
  4. Check network (ports, DNS)
  5. Check dependencies (database, APIs)

Anti-Patterns to Avoid

❌ Don’t✅ Do
Run as rootUse non-root user
Ignore logsSet up log rotation
Skip monitoringMonitor from day one
Manual restartsAuto-restart config
No backupsRegular backup schedule

Process Management Examples

PM2 (Node.js)

# Start with clustering
pm2 start app.js -i max

# Zero-downtime reload
pm2 reload app

# Monitor
pm2 monit

systemd (Linux)

# Start service
systemctl start myapp

# Enable on boot
systemctl enable myapp

# Check status
systemctl status myapp

Docker

# Run with restart policy
docker run -d --restart=always myapp

# View logs
docker logs -f myapp

Monitoring Setup

Basic Health Endpoint

app.get('/health', async (req, res) => {
  try {
    await db.ping();
    res.status(200).json({ status: 'ok' });
  } catch (error) {
    res.status(503).json({ status: 'error' });
  }
});
  • deployment-procedures: Deployment workflows
  • bash-linux: Linux server commands
  • powershell-windows: Windows server management
  • clean-code: Service code quality

Which Agents Use This Skill

  • devops-engineer: Primary user for all server operations

Best Practices

  1. Automate everything possible
  2. Monitor from day one
  3. Secure by default
  4. Document your setup
  5. Test failure scenarios
  6. Backup regularly

Resource Monitoring

Key Metrics

MetricThresholdAction
CPU>80% sustainedInvestigate/scale
Memory>90%Check for leaks
Disk>85%Clean/expand
NetworkSaturatedCheck traffic

Tools Available

  • Read, Write, Edit: For config files
  • Glob, Grep: For searching logs
  • Bash: For server commands

Remember: A well-managed server is boring. That’s the goal.

Build docs developers (and LLMs) love