Monitoring & Observability

Overview

Duckling provides comprehensive monitoring capabilities including health checks, system metrics, query metrics, and automated health monitoring with auto-restart.

Health Checks

Health Endpoint

Basic health check for database connectivity:

curl http://localhost:3001/health?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"

Response (Healthy):

{
  "status": "healthy",
  "timestamp": "2026-03-01T10:30:00.000Z",
  "duckdb": "connected",
  "mysql": "connected",
  "uptime": 86400
}

Response (Unhealthy):

{
  "status": "unhealthy",
  "timestamp": "2026-03-01T10:30:00.000Z",
  "duckdb": "connected",
  "mysql": "error",
  "error": "Connection timeout"
}

CLI Health Check

docker exec duckling-server node packages/server/dist/cli.js health

Status Endpoint

Detailed system status with table counts and metrics:

curl http://localhost:3001/status?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"

Response:

{
  "status": "healthy",
  "timestamp": "2026-03-01T10:30:00.000Z",
  "databases": {
    "duckdb": "connected",
    "mysql": "connected"
  },
  "tables": {
    "total": 42,
    "synced": 42
  },
  "sync": {
    "lastSync": "2026-03-01T10:15:00.000Z",
    "nextSync": "2026-03-01T10:30:00.000Z",
    "mode": "incremental"
  },
  "automation": {
    "sync": true,
    "backup": true,
    "cleanup": true,
    "healthMonitoring": true
  }
}

Automatic Health Monitoring

Auto-Restart Service

Duckling includes automatic health monitoring and recovery:

Health Check Interval: Every 60 seconds
Auto-Restart: Enabled by default (AUTO_RESTART=true)
Max Restart Attempts: 3 attempts (MAX_RESTART_ATTEMPTS=3)
Recovery Strategy: Exponential backoff

The auto-restart service monitors DuckDB and MySQL connections, automatically recovering from failures without manual intervention.

Health Monitoring Checks

The system performs three health checks every 60 seconds:

DuckDB Health

Executes SELECT 1 to verify DuckDB connectivity

MySQL Health

Tests MySQL connection pool

Sync Health

Verifies sync has run within last 30 minutes

Recovery Process

When health check fails:

Attempt 1: Test connections, trigger recovery sync
Wait: Exponential backoff (2s, 4s, 8s, max 60s)
Attempt 2: Retry connection tests and sync
Wait: Longer backoff
Attempt 3: Final retry attempt
Failure: Log critical error, manual intervention required

After 3 failed recovery attempts, the service stops attempting automatic recovery. Check logs and investigate the root cause.

Disabling Auto-Restart

To disable automatic recovery:

AUTO_RESTART=false

System Metrics

Metrics Endpoint

Get comprehensive system and query metrics:

curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"

Response:

{
  "system": {
    "current": {
      "cpuPercent": 12.3,
      "rssMB": 456.7,
      "heapUsedMB": 234.5,
      "hostFreeMemMB": 8192.0,
      "hostTotalMemMB": 16384.0,
      "eventLoopLagMs": 2
    },
    "history": [
      {
        "ts": "2026-03-01T10:29:30.000Z",
        "cpuPercent": 11.8,
        "rssMB": 453.2,
        "eventLoopLagMs": 1
      }
    ]
  },
  "queries": {
    "active": [
      {
        "id": "abc123",
        "sql": "SELECT COUNT(*) FROM User WHERE createdAt > ?",
        "startedAt": "2026-03-01T10:30:00.000Z",
        "runningSec": 0.5,
        "databaseId": "lms"
      }
    ],
    "totalExecuted": 12543,
    "patterns": [
      {
        "pattern": "SELECT COUNT(*) FROM User WHERE createdAt > ?",
        "count": 234,
        "avgMs": 45,
        "minMs": 12,
        "maxMs": 234,
        "lastRun": "2026-03-01T10:30:00.000Z"
      }
    ]
  }
}

System Metrics

System metrics are collected every 30 seconds:

Metric	Description	Unit
`cpuPercent`	Node.js process CPU usage	Percentage
`rssMB`	Resident Set Size (total memory)	MB
`heapUsedMB`	V8 heap memory usage	MB
`hostFreeMemMB`	Available system memory	MB
`hostTotalMemMB`	Total system memory	MB
`eventLoopLagMs`	Event loop delay	Milliseconds

Metric History:

Retention: Last 61 samples (30.5 minutes)
Sample Interval: 30 seconds
Buffer: Rolling window, oldest samples dropped

The system metrics service starts automatically with the server. Historical data is kept in-memory for lightweight monitoring.

Query Metrics

Query metrics track all DuckDB queries: Active Queries:

Currently executing queries
SQL statement (truncated to 200 chars)
Start timestamp
Running duration
Database ID

Query Patterns:

Normalized SQL (literals replaced with ?)
Execution count
Average/min/max duration
Last execution timestamp
Top 100 patterns by frequency

Pattern Normalization:

-- Original queries:
SELECT * FROM User WHERE id = 123
SELECT * FROM User WHERE id = 456

-- Normalized pattern:
SELECT * FROM User WHERE id = ?

Query patterns use LRU (Least Recently Used) eviction with a maximum of 1,000 patterns tracked. This prevents memory growth on databases with high query variety.

Automation Status

Automation Endpoint

Get status of all automation services:

curl http://localhost:3001/api/automation/status?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"

Response:

{
  "isRunning": true,
  "autoCleanup": {
    "enabled": true,
    "intervalHours": 24,
    "retentionDays": 90
  },
  "autoBackup": {
    "enabled": true,
    "intervalHours": 24,
    "retentionDays": 7
  },
  "s3Backup": {
    "scheduled": true,
    "intervalHours": 24,
    "retentionDays": 30
  },
  "autoRestart": {
    "enabled": true,
    "restartAttempts": 0,
    "maxAttempts": 3,
    "lastSuccessfulSync": "2026-03-01T10:15:00.000Z"
  },
  "sync": {
    "enabled": true,
    "intervalMinutes": 15
  }
}

Automation Configuration

Variable	Default	Description
`AUTO_START_SYNC`	`true`	Enable automatic sync
`AUTO_CLEANUP`	`true`	Enable automatic cleanup
`AUTO_BACKUP`	`true`	Enable automatic backups
`AUTO_RESTART`	`true`	Enable health monitoring and auto-restart
`MAX_RESTART_ATTEMPTS`	`3`	Maximum recovery attempts

Log Monitoring

Server Logs

View real-time server logs:

# All logs
docker-compose logs -f duckdb-server

# Filter for sync operations
docker-compose logs -f duckdb-server | grep -i sync

# Filter for errors
docker-compose logs -f duckdb-server | grep -i error

# Filter for health checks
docker-compose logs -f duckdb-server | grep -i health

Log Levels

Configure log verbosity with LOG_LEVEL:

LOG_LEVEL=debug  # Verbose debugging
LOG_LEVEL=info   # Standard operations (default)
LOG_LEVEL=warn   # Warnings only
LOG_LEVEL=error  # Errors only

Structured Logging

Duckling uses Winston for structured logging:

{
  "level": "info",
  "message": "Incremental sync completed",
  "timestamp": "2026-03-01T10:15:00.000Z",
  "databaseId": "lms",
  "tables": 42,
  "records": 1523,
  "duration": 2340
}

Alerting

Health Check Monitoring

Integrate health checks with monitoring tools: Prometheus:

scrape_configs:
  - job_name: 'duckling'
    metrics_path: '/health'
    static_configs:
      - targets: ['localhost:3001']

Uptime Monitoring:

# Configure external monitoring
curl -f http://localhost:3001/health || exit 1

Critical Alerts

Monitor these conditions:

Health endpoint returns unhealthy (5xx status)
Sync hasn’t run in 30+ minutes (check automation status)
Restart attempts >= 3 (check automation status)
Event loop lag > 100ms (check system metrics)
Memory usage > 80% (check system metrics)

Set up external monitoring for production deployments. The auto-restart service provides recovery but not alerting.

Performance Metrics

Sync Performance

Track sync operations from sync logs:

curl "http://localhost:3001/api/sync-logs?limit=100&db=your-database-id" \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"

Key metrics:

Records processed per sync
Duration per table
Success/error rate
Watermark progression

Query Performance

Monitor slow queries from query patterns:

curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.queries.patterns | sort_by(.avgMs) | reverse | .[0:10]'

Identify queries with:

High average duration
High max duration
High execution count

Dashboard Integration

Web Dashboard

Access the built-in dashboard:

http://localhost:3000

Features:

Real-time system metrics graphs
Active queries view
Query pattern analysis
Sync log history
Automation status

API Integration

Build custom dashboards using API endpoints:

// Fetch metrics every 10 seconds
setInterval(async () => {
  const response = await fetch(
    'http://localhost:3001/metrics?db=lms',
    {
      headers: {
        'Authorization': `Bearer ${API_KEY}`
      }
    }
  );
  const metrics = await response.json();
  updateDashboard(metrics);
}, 10000);

Multi-Database Monitoring

Monitor multiple databases:

# Get list of databases
curl http://localhost:3001/api/databases \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"

# Check health for each database
for db in lms analytics common; do
  echo "Database: $db"
  curl "http://localhost:3001/health?db=$db" \
    -H "Authorization: Bearer ${DUCKLING_API_KEY}"
done

Each database has independent health status, metrics, and automation services. Monitor them separately for accurate observability.

Troubleshooting

High CPU Usage

Check active queries: GET /metrics
Review query patterns for inefficient queries
Reduce concurrent query load
Consider query optimization

High Memory Usage

Check system metrics: GET /metrics
Review batch sizes: BATCH_SIZE, INSERT_BATCH_SIZE
Reduce connection pool sizes
Monitor for memory leaks in logs

Event Loop Lag

Check system metrics for eventLoopLagMs
Reduce concurrent operations
Increase worker threads: WORKER_THREADS
Review long-running queries

Next Steps

Synchronization - Configure sync operations
Backups - Set up backup automation
Performance Tuning - Optimize performance

Get Started

Core Concepts

Setup & Deployment

Operations

Advanced Features

​Overview

​Health Checks

​Health Endpoint

​CLI Health Check

​Status Endpoint

​Automatic Health Monitoring

​Auto-Restart Service

​Health Monitoring Checks

​Recovery Process

​Disabling Auto-Restart

​System Metrics

​Metrics Endpoint

​System Metrics

​Query Metrics

​Automation Status

​Automation Endpoint

​Automation Configuration

​Log Monitoring

​Server Logs

​Log Levels

​Structured Logging

​Alerting

​Health Check Monitoring

​Critical Alerts

​Performance Metrics

​Sync Performance

​Query Performance

​Dashboard Integration

​Web Dashboard

​API Integration

​Multi-Database Monitoring

​Troubleshooting

​High CPU Usage

​High Memory Usage

​Event Loop Lag

​Next Steps

Build docs developers (and LLMs) love

Overview

Health Checks

Health Endpoint

CLI Health Check

Status Endpoint

Automatic Health Monitoring

Auto-Restart Service

Health Monitoring Checks

Recovery Process

Disabling Auto-Restart

System Metrics

Metrics Endpoint

System Metrics

Query Metrics

Automation Status

Automation Endpoint

Automation Configuration

Log Monitoring

Server Logs

Log Levels

Structured Logging

Alerting

Health Check Monitoring

Critical Alerts

Performance Metrics

Sync Performance

Query Performance

Dashboard Integration

Web Dashboard

API Integration

Multi-Database Monitoring

Troubleshooting

High CPU Usage

High Memory Usage

Event Loop Lag

Next Steps