Monitoring & Observability

Overview

RaidBot includes a comprehensive monitoring system with Prometheus-compatible metrics and Discord-based alerting. Monitor bot health, performance, and usage patterns in real-time.

Metrics are automatically collected during bot operation with minimal performance overhead.

Metrics System

Available Metrics

RaidBot tracks three types of metrics:

Counters

Cumulative counts that only increase (commands executed, raids created, etc.)

Gauges

Current state values that can go up or down (active raids, memory usage)

Histograms

Distribution of values over time (command latency, query duration)

Counter Metrics

// Track command executions
metrics.incrementCounter('commands_total', { command: 'raid' });

// Track reaction handling
metrics.incrementCounter('reactions_total', { action: 'add' });

// Track raid lifecycle
metrics.incrementCounter('raids_created_total');
metrics.incrementCounter('raids_closed_total');

// Track waitlist operations
metrics.incrementCounter('waitlist_promotions_total');

// Track failures
metrics.incrementCounter('dm_failures_total');

Gauge Metrics

// Update current state
metrics.setGauge('active_raids_gauge', activeRaidCount);
metrics.setGauge('participants_gauge', totalParticipants);

Histogram Metrics

// Record timing data
const start = Date.now();
// ... execute command ...
const duration = (Date.now() - start) / 1000;
metrics.recordHistogram('command_duration_seconds', duration, { command: 'raid' });

// Record database query times
const queryStart = Date.now();
const result = db.prepare('SELECT * FROM raids').all();
const queryDuration = (Date.now() - queryStart) / 1000;
metrics.recordHistogram('db_query_duration_seconds', queryDuration);

Metric Storage

Metrics are stored in-memory with automatic retention limits:

const metrics = {
    // Counters
    reactions_total: new Map(), // key: 'add|remove' -> count
    commands_total: new Map(), // key: commandName -> count
    dm_failures_total: 0,
    raids_created_total: 0,
    raids_closed_total: 0,
    waitlist_promotions_total: 0,

    // Histograms (last 1000 samples per command)
    command_duration_seconds: new Map(), // key: commandName -> [durations]
    db_query_duration_seconds: [], // last 10,000 queries

    // Gauges
    active_raids_gauge: 0,
    participants_gauge: 0
};

Histogram samples are automatically pruned to prevent unbounded memory growth while maintaining statistical accuracy.

Prometheus Export

Generate Metrics Output

Export metrics in Prometheus format:

const { generatePrometheusMetrics } = require('./utils/metrics');

const metricsText = generatePrometheusMetrics();
console.log(metricsText);

Example Output

# HELP wizbot_reactions_total Total number of reactions processed
# TYPE wizbot_reactions_total counter
wizbot_reactions_total{action="add"} 1523
wizbot_reactions_total{action="remove"} 342

# HELP wizbot_commands_total Total number of commands executed
# TYPE wizbot_commands_total counter
wizbot_commands_total{command="raid"} 245
wizbot_commands_total{command="signup"} 89
wizbot_commands_total{command="stats"} 67

# HELP wizbot_command_duration_seconds Command execution duration
# TYPE wizbot_command_duration_seconds summary
wizbot_command_duration_seconds_count{command="raid"} 245
wizbot_command_duration_seconds_sum{command="raid"} 12.456
wizbot_command_duration_seconds{command="raid",quantile="0.5"} 0.045
wizbot_command_duration_seconds{command="raid",quantile="0.95"} 0.123
wizbot_command_duration_seconds{command="raid",quantile="0.99"} 0.234

# HELP wizbot_active_raids Current number of active raids
# TYPE wizbot_active_raids gauge
wizbot_active_raids 12

Histogram Statistics

Histograms automatically calculate percentiles:

function calculateHistogramStats(values) {
    if (values.length === 0) {
        return { count: 0, sum: 0, p50: 0, p95: 0, p99: 0 };
    }

    const sorted = [...values].sort((a, b) => a - b);
    const count = sorted.length;
    const sum = sorted.reduce((a, b) => a + b, 0);

    return {
        count,
        sum,
        p50: sorted[Math.floor(count * 0.50)] || 0,  // Median
        p95: sorted[Math.floor(count * 0.95)] || 0,  // 95th percentile
        p99: sorted[Math.floor(count * 0.99)] || 0   // 99th percentile
    };
}

JSON Metrics

Get metrics as structured JSON for logging or custom integrations:

const { getMetricsJSON } = require('./utils/metrics');

const metrics = getMetricsJSON();
console.log(JSON.stringify(metrics, null, 2));

Example JSON Output

{
  "counters": {
    "reactions_total": {
      "action=\"add\"": 1523,
      "action=\"remove\"": 342
    },
    "commands_total": {
      "command=\"raid\"": 245,
      "command=\"signup\"": 89
    },
    "dm_failures_total": 5,
    "raids_created_total": 245,
    "raids_closed_total": 198,
    "waitlist_promotions_total": 34
  },
  "histograms": {
    "command_duration_seconds": [
      {
        "labels": "command=\"raid\"",
        "count": 245,
        "sum": 12.456,
        "p50": 0.045,
        "p95": 0.123,
        "p99": 0.234
      }
    ],
    "db_query_duration_seconds": {
      "count": 8932,
      "sum": 45.678,
      "p50": 0.003,
      "p95": 0.012,
      "p99": 0.028
    }
  },
  "gauges": {
    "active_raids_gauge": 12,
    "participants_gauge": 156
  },
  "timestamp": "2026-03-03T10:30:00.000Z"
}

Automatic Logging

Enable periodic metrics logging:

const { startMetricsLogging } = require('./utils/metrics');

// Log metrics every 5 minutes (default)
startMetricsLogging();

// Custom interval (every 10 minutes)
startMetricsLogging(10 * 60 * 1000);

Metrics will be logged to your configured logger with full statistics.

Alert System

Alert Configuration

Configure alerting thresholds and owner notifications:

const ALERT_CONFIG = {
    // Set via environment variable
    ownerId: process.env.BOT_OWNER_ID || null,

    // Alert thresholds
    thresholds: {
        commandLatencyP95: 2.0,        // Alert if p95 > 2 seconds
        dmFailureRate: 0.15,           // Alert if >15% DM failure rate
        activeRaidsHigh: 50,           // Alert if >50 active raids
        memoryUsageMB: 512,            // Alert if >512MB RAM
        circuitBreakerOpen: true       // Alert when circuit breaker opens
    },

    // Alert cooldowns (prevent spam)
    cooldowns: {
        commandLatency: 15 * 60 * 1000,    // 15 minutes
        dmFailureRate: 30 * 60 * 1000,     // 30 minutes
        activeRaidsHigh: 60 * 60 * 1000,   // 1 hour
        memoryUsage: 30 * 60 * 1000,       // 30 minutes
        circuitBreaker: 10 * 60 * 1000     // 10 minutes
    }
};

Initialize Alerts

const { initializeAlerts } = require('./utils/alerts');

// Initialize with Discord client and owner ID
initializeAlerts(client, process.env.BOT_OWNER_ID);

Set the BOT_OWNER_ID environment variable to receive Discord DM alerts. Without it, alerts will be logged but not sent.

Alert Types

High Command Latency

// Triggered when p95 latency exceeds threshold
if (cmd.p95 > ALERT_CONFIG.thresholds.commandLatencyP95) {
    await sendAlert(client, {
        title: '⚠️ High Command Latency',
        description: 'Command execution is taking longer than expected.',
        fields: [
            { name: 'Command', value: cmd.labels, inline: true },
            { name: 'P95 Latency', value: `${cmd.p95.toFixed(2)}s`, inline: true },
            { name: 'Threshold', value: '2.0s', inline: true }
        ],
        color: 0xFEE75C // Yellow
    });
}

Circuit Breaker Alerts

// Alert when circuit breaker opens
if (state.state === 'OPEN' && previousState !== 'OPEN') {
    await sendAlert(client, {
        title: '⚠️ Circuit Breaker Opened',
        description: `The **${name}** circuit breaker has opened due to repeated failures.`,
        fields: [
            { name: 'State', value: state.state, inline: true },
            { name: 'Failure Count', value: state.failureCount.toString(), inline: true },
            { name: 'Next Retry', value: `<t:${Math.floor(state.nextAttempt / 1000)}:R>`, inline: true }
        ],
        color: 0xED4245 // Red
    });
}

DM Failure Rate

// Alert when too many DMs fail
const failureRate = dmFailures / waitlistPromotions;
if (failureRate > ALERT_CONFIG.thresholds.dmFailureRate) {
    await sendAlert(client, {
        title: '⚠️ High DM Failure Rate',
        description: 'Many DMs are failing to deliver to users.',
        fields: [
            { name: 'Failure Rate', value: `${(failureRate * 100).toFixed(1)}%`, inline: true },
            { name: 'Failed DMs', value: dmFailures.toString(), inline: true }
        ],
        color: 0xFEE75C
    });
}

Memory Usage

// Alert on high memory consumption
if (heapUsedMB > ALERT_CONFIG.thresholds.memoryUsageMB) {
    await sendAlert(client, {
        title: '⚠️ High Memory Usage',
        description: 'Bot is using more memory than expected.',
        fields: [
            { name: 'Heap Used', value: `${heapUsedMB.toFixed(1)} MB`, inline: true },
            { name: 'Threshold', value: '512 MB', inline: true }
        ],
        color: 0xED4245 // Red
    });
}

Daily Health Report

Receive automatic daily health summaries at 9 AM:

// Sent automatically every day
await sendAlert(client, {
    title: '📊 Daily Health Report',
    description: `Bot health summary for ${now.toLocaleDateString()}`,
    fields: [
        { name: '⏱️ Uptime', value: `${uptimeDays}d ${uptimeHours}h`, inline: true },
        { name: '🎮 Active Raids', value: metrics.gauges.active_raids_gauge.toString(), inline: true },
        { name: '👥 Participants', value: metrics.gauges.participants_gauge.toString(), inline: true },
        { name: '📝 Commands', value: totalCommands.toString(), inline: true },
        { name: '💾 Memory', value: `${memoryMB.toFixed(1)} MB`, inline: true }
    ],
    color: 0x57F287 // Green
});

Test Alerts

Verify alert delivery:

const { sendTestAlert } = require('./utils/alerts');

await sendTestAlert(client);

Cooldown System

Alerts respect cooldown periods to prevent spam:

function canSendAlert(key) {
    const lastTime = lastAlertTimes.get(key);
    if (!lastTime) return true;

    const cooldown = ALERT_CONFIG.cooldowns[key.split('_')[0]] || 15 * 60 * 1000;
    return Date.now() - lastTime > cooldown;
}

Cooldowns ensure you’re not overwhelmed with alerts while still being notified of ongoing issues.

Integration Examples

HTTP Endpoint for Prometheus

Expose metrics via HTTP for Prometheus scraping:

const express = require('express');
const { generatePrometheusMetrics } = require('./utils/metrics');

const app = express();

app.get('/metrics', (req, res) => {
    res.set('Content-Type', 'text/plain');
    res.send(generatePrometheusMetrics());
});

app.listen(9090, () => {
    console.log('Metrics endpoint available at http://localhost:9090/metrics');
});

Grafana Dashboard

Create visualizations using Prometheus metrics:

# prometheus.yml
scrape_configs:
  - job_name: 'wizbot'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 30s

Custom Logging Integration

const { getMetricsJSON } = require('./utils/metrics');

// Log to external service
setInterval(() => {
    const metrics = getMetricsJSON();
    
    // Send to logging service (e.g., Datadog, New Relic)
    loggingService.log({
        service: 'wizbot',
        type: 'metrics',
        data: metrics
    });
}, 60000); // Every minute

Troubleshooting

Alerts Not Received

Problem: Owner not receiving Discord DMs Solutions:

Verify BOT_OWNER_ID environment variable is set
Ensure bot can DM the owner (shared server or friend)
Check alert cooldowns haven’t suppressed recent alerts
Review logs for alert delivery errors

High Memory Usage

Problem: Memory gauge exceeds threshold Solutions:

Check for memory leaks using heap snapshots
Review histogram retention (currently 1,000 samples per command)
Ensure closed raids are being cleaned up
Consider reducing metric retention periods

Missing Metrics

Problem: Expected metrics not appearing Solutions:

Verify metrics are being incremented in code
Check metric names match expected format
Ensure metrics aren’t being reset unexpectedly
Review logs for metric collection errors

Best Practices

Set appropriate thresholds based on your bot’s normal behavior
Use cooldowns to prevent alert fatigue
Monitor trends not just absolute values
Test alerts before deploying to production
Retain metrics only as long as needed for analysis
Export to external systems for long-term storage

Get Started

Core Features

Configuration

User Guides

Advanced

​Overview

​Metrics System

​Available Metrics

Counters

Gauges

Histograms

​Counter Metrics

​Gauge Metrics

​Histogram Metrics

​Metric Storage

​Prometheus Export

​Generate Metrics Output

​Example Output

​Histogram Statistics

​JSON Metrics

​Example JSON Output

​Automatic Logging

​Alert System

​Alert Configuration

​Initialize Alerts

​Alert Types

​High Command Latency

​Circuit Breaker Alerts

​DM Failure Rate

​Memory Usage

​Daily Health Report

​Test Alerts

​Cooldown System

​Integration Examples

​HTTP Endpoint for Prometheus

​Grafana Dashboard

​Custom Logging Integration

​Troubleshooting

​Alerts Not Received

​High Memory Usage

​Missing Metrics

​Best Practices

​See Also

Build docs developers (and LLMs) love

Overview

Metrics System

Available Metrics

Counter Metrics

Gauge Metrics

Histogram Metrics

Metric Storage

Prometheus Export

Generate Metrics Output

Example Output

Histogram Statistics

JSON Metrics

Example JSON Output

Automatic Logging

Alert System

Alert Configuration

Initialize Alerts

Alert Types

High Command Latency

Circuit Breaker Alerts

DM Failure Rate

Memory Usage

Daily Health Report

Test Alerts

Cooldown System

Integration Examples

HTTP Endpoint for Prometheus

Grafana Dashboard

Custom Logging Integration

Troubleshooting

Alerts Not Received

High Memory Usage

Missing Metrics

Best Practices

See Also