Server Management - Antigravity Kit

Overview

The server-management skill teaches server management principles for production operations. It covers process management, monitoring strategy, log management, scaling decisions, and troubleshooting procedures.

What This Skill Provides

Process Management: PM2, systemd, Docker, Kubernetes selection
Monitoring Principles: What to monitor and alert strategies
Log Management: Logging strategy and rotation
Scaling Decisions: When and how to scale
Health Checks: Implementing service health endpoints
Security Principles: Server security best practices
Troubleshooting: Systematic problem diagnosis

Philosophy

Learn to THINK, not memorize commands. Server management is about understanding principles and making informed decisions based on your specific context.

Process Management Principles

Tool Selection

Scenario	Tool
Node.js app	PM2 (clustering, reload)
Any app	systemd (Linux native)
Containers	Docker/Podman
Orchestration	Kubernetes, Docker Swarm

Process Management Goals

Goal	What It Means
Restart on crash	Auto-recovery
Zero-downtime reload	No service interruption
Clustering	Use all CPU cores
Persistence	Survive server reboot

Monitoring Principles

What to Monitor

Category	Key Metrics
Availability	Uptime, health checks
Performance	Response time, throughput
Errors	Error rate, types
Resources	CPU, memory, disk

Alert Severity Strategy

Level	Response
Critical	Immediate action
Warning	Investigate soon
Info	Review daily

Monitoring Tool Selection

Need	Options
Simple/Free	PM2 metrics, htop
Full observability	Grafana, Datadog
Error tracking	Sentry
Uptime	UptimeRobot, Pingdom

Use Cases

When to Use This Skill

Setting up server infrastructure
Implementing monitoring and alerting
Troubleshooting server issues
Planning scaling strategies
Securing server environments
Managing logs and processes

Example Scenarios

Setup: “Configure PM2 for a Node.js application”
Monitoring: “Set up monitoring for this production server”
Scaling: “Server CPU is at 90%, should I scale?”
Troubleshooting: “Service keeps crashing, help diagnose”

Log Management Principles

Log Strategy

Log Type	Purpose
Application logs	Debug, audit
Access logs	Traffic analysis
Error logs	Issue detection

Log Principles

Rotate logs to prevent disk fill
Structured logging (JSON) for parsing
Appropriate levels (error/warn/info/debug)
No sensitive data in logs

Scaling Decisions

When to Scale

Symptom	Solution
High CPU	Add instances (horizontal)
High memory	Increase RAM or fix leak
Slow response	Profile first, then scale
Traffic spikes	Auto-scaling

Scaling Strategy

Type	When to Use
Vertical	Quick fix, single instance
Horizontal	Sustainable, distributed
Auto	Variable traffic

Health Check Principles

What Constitutes Healthy

Check	Meaning
HTTP 200	Service responding
Database connected	Data accessible
Dependencies OK	External services reachable
Resources OK	CPU/memory not exhausted

Health Check Implementation

Simple: Just return 200
Deep: Check all dependencies
Choose based on load balancer needs

Security Principles

Area	Principle
Access	SSH keys only, no passwords
Firewall	Only needed ports open
Updates	Regular security patches
Secrets	Environment vars, not files
Audit	Log access and changes

Troubleshooting Priority

When something’s wrong:

Check if running (process status)
Check logs (error messages)
Check resources (disk, memory, CPU)
Check network (ports, DNS)
Check dependencies (database, APIs)

Anti-Patterns to Avoid

❌ Don’t	✅ Do
Run as root	Use non-root user
Ignore logs	Set up log rotation
Skip monitoring	Monitor from day one
Manual restarts	Auto-restart config
No backups	Regular backup schedule

Process Management Examples

PM2 (Node.js)

# Start with clustering
pm2 start app.js -i max

# Zero-downtime reload
pm2 reload app

# Monitor
pm2 monit

systemd (Linux)

# Start service
systemctl start myapp

# Enable on boot
systemctl enable myapp

# Check status
systemctl status myapp

Docker

# Run with restart policy
docker run -d --restart=always myapp

# View logs
docker logs -f myapp

Monitoring Setup

Basic Health Endpoint

app.get('/health', async (req, res) => {
  try {
    await db.ping();
    res.status(200).json({ status: 'ok' });
  } catch (error) {
    res.status(503).json({ status: 'error' });
  }
});

deployment-procedures: Deployment workflows
bash-linux: Linux server commands
powershell-windows: Windows server management
clean-code: Service code quality

Which Agents Use This Skill

devops-engineer: Primary user for all server operations

Best Practices

Automate everything possible
Monitor from day one
Secure by default
Document your setup
Test failure scenarios
Backup regularly

Resource Monitoring

Key Metrics

Metric	Threshold	Action
CPU	>80% sustained	Investigate/scale
Memory	>90%	Check for leaks
Disk	>85%	Clean/expand
Network	Saturated	Check traffic

Tools Available

Read, Write, Edit: For config files
Glob, Grep: For searching logs
Bash: For server commands

Remember: A well-managed server is boring. That’s the goal.

Frontend

Backend

Database

Testing & Quality

DevOps & Deployment

Security

Optimization

Platform & Tools

Planning & Architecture

Advanced

​Overview

​What This Skill Provides

​Philosophy

​Process Management Principles

​Tool Selection

​Process Management Goals

​Monitoring Principles

​What to Monitor

​Alert Severity Strategy

​Monitoring Tool Selection

​Use Cases

​When to Use This Skill

​Example Scenarios

​Log Management Principles

​Log Strategy

​Log Principles

​Scaling Decisions

​When to Scale

​Scaling Strategy

​Health Check Principles

​What Constitutes Healthy

​Health Check Implementation

​Security Principles

​Troubleshooting Priority

​Anti-Patterns to Avoid

​Process Management Examples

​PM2 (Node.js)

​systemd (Linux)

​Docker

​Monitoring Setup

​Basic Health Endpoint

​Related Skills

​Which Agents Use This Skill

​Best Practices

​Resource Monitoring

​Key Metrics

​Tools Available

Build docs developers (and LLMs) love