Skip to main content

Overview

Monitoring the Predictive Maintenance System ensures reliable anomaly detection and early identification of infrastructure issues. This guide covers health checks, log watching, performance metrics, and cloud deployment monitoring.

Health Checks

Backend Health Endpoint

The /health endpoint provides a quick system status check:
curl http://localhost:8000/health
Response:
{
  "status": "healthy",
  "db_connected": true,
  "timestamp": "2026-03-02T14:30:00Z"
}
FieldDescription
statushealthy or degraded
db_connectedInfluxDB connection status
timestampServer time (UTC)
Use this endpoint for uptime monitoring tools like UptimeRobot, Pingdom, or Datadog.

Keep-Alive Ping

The /ping endpoint is a lightweight health check (no database queries):
curl http://localhost:8000/ping
Response:
{
  "status": "ok"
}
The frontend sends a 10-minute heartbeat to /ping to prevent Render free-tier cold starts (see Render Deployment below).

System State Monitoring

Check Current State

The /system/state endpoint shows operational mode:
curl http://localhost:8000/system/state
Response:
{
  "state": "MONITORING",
  "asset_id": "Motor-01",
  "monitoring_start_time": "2026-03-02T14:00:00Z",
  "calibration_status": "complete",
  "baseline_sample_count": 500,
  "model_version": "v3"
}

System States

No active monitoring. System waiting for calibration or data ingestion.Action: Start calibration or monitoring.
Collecting healthy baseline data. Models not yet trained.Action: Wait for calibration to complete (typically 5-10 minutes).
Active anomaly detection. Models trained and scoring incoming data.Action: Normal operations. Monitor health scores.
Fault simulation active. Anomalies are intentional.Action: Demo/testing mode. Clear fault to resume normal monitoring.

Log Watching

Backend Logs

# View live logs
docker-compose logs -f backend

# Last 100 lines
docker-compose logs --tail=100 backend

# Specific timestamp range
docker-compose logs --since="2026-03-02T14:00:00" backend

Key Log Patterns

Meaning: Model training completed.Log Example:
[Calibration] Baseline built successfully for Motor-01: 500 samples
[Calibration] Batch detector trained on 480 windows (16 features)
Action: System ready for monitoring.
Meaning: Periodic health assessment logged.Log Example:
[Health] Health score: 42 (HIGH) - Vibration variance elevated
Action: If risk is HIGH or CRITICAL, investigate explanations.
Meaning: InfluxDB query failed.Log Example:
[Retrain] ERROR fetching data: unauthorized access
Action: Check InfluxDB credentials and network connectivity (see Troubleshooting).
Meaning: Pre-trained model restored on startup.Log Example:
[Detector] Model loaded from backend/models/Motor-01_batch_detector_v3.pkl
Action: Normal. Model persists across restarts.

Performance Metrics

Real-Time Metrics Dashboard

The frontend Dashboard tab displays live performance indicators:

Health Score

0-100 scale derived from Degradation Index (DI)

Anomaly Score

0.0-1.0 probability from Isolation Forest

Damage Rate

DI increase per hour (% degradation/hour)

RUL (Remaining Useful Life)

Projected hours until maintenance required

Historical Metrics Query

Retrieve time-series metrics via API:
curl -X GET "http://localhost:8000/integration/sensor-history?asset_id=Motor-01&seconds=3600" \
  | jq '.[] | {timestamp, health_score, anomaly_score}'
Response:
[
  {
    "timestamp": "2026-03-02T14:00:00Z",
    "health_score": 95,
    "anomaly_score": 0.08
  },
  {
    "timestamp": "2026-03-02T14:01:00Z",
    "health_score": 93,
    "anomaly_score": 0.12
  }
]

Baseline Metrics

Check calibrated baseline targets:
curl http://localhost:8000/integration/health-status?asset_id=Motor-01 \
  | jq '.baseline_targets'
Response:
{
  "voltage_v": 230.0,
  "current_a": 15.0,
  "power_factor": 0.92,
  "vibration_g": 0.15
}
These values are displayed on the frontend Status Cards for instant comparison with live readings.

Deployment Status Monitoring

Render (Backend) Deployment Status

The system is deployed on Render Free Tier, which has specific monitoring considerations:

Cold Start Behavior

Render free-tier services spin down after 15 minutes of inactivity. The first request after spin-down takes 30-60 seconds to respond.
Cold Start Detection:
  • Frontend shows STATUS: OFFLINE badge
  • First API request returns HTTP 503 or timeout
  • Render logs show: Starting service...
Cold Start Mitigation: The frontend automatically sends a keep-alive heartbeat:
// Keep-alive ping every 10 minutes
setInterval(() => {
  fetch(`${API_URL}/ping`).catch(() => {});
}, 10 * 60 * 1000);
For production deployments, upgrade to Render Starter ($7/month) to eliminate cold starts.

Monitoring Render Health

  1. Via Dashboard:
    • Open Render Dashboard
    • Check service status (green = running, gray = spun down)
    • Review Events tab for deploy/restart logs
  2. Via API:
    curl -H "Authorization: Bearer $RENDER_API_KEY" \
      https://api.render.com/v1/services/$SERVICE_ID
    
  3. Via Application:
    • Frontend STATUS: LIVE badge (green = healthy, red = offline)
    • Check /health endpoint response time (500ms = warm, >2s = cold start)

Vercel (Frontend) Deployment Status

The frontend is deployed on Vercel Free Tier.

Monitoring Vercel Health

  1. Via Dashboard:
    • Open Vercel Dashboard
    • Check Deployments tab for build status
    • Review Analytics for traffic and error rates
  2. Via Application:
Vercel Edge Network has no cold starts. The frontend loads instantly worldwide.

InfluxDB Cloud Status

Monitor InfluxDB connectivity:
curl -X GET "$INFLUX_URL/health" \
  -H "Authorization: Token $INFLUX_TOKEN"
Response:
{
  "name": "influxdb",
  "message": "ready for queries and writes",
  "status": "pass",
  "checks": [],
  "version": "v2.7.1",
  "commit": "..."
}
InfluxDB is healthy and accepting requests.
Database is down. Check InfluxDB Status Page.
Invalid token. Verify INFLUX_TOKEN environment variable.

Alerting and Notifications

Event Webhook Integration

The /integration/events endpoint streams state transitions:
curl -X GET "http://localhost:8000/integration/events?asset_id=Motor-01&limit=10"
Response:
[
  {
    "event_type": "ANOMALY_DETECTED",
    "timestamp": "2026-03-02T14:30:00Z",
    "asset_id": "Motor-01",
    "message": "High vibration variance (mechanical jitter): σ=0.17g",
    "severity": "HIGH",
    "health_score": 42
  }
]

Setting Up Alerts

1

Choose a Notification Service

Options: Slack, PagerDuty, email (via SendGrid), SMS (via Twilio)
2

Poll Events Endpoint

Set up a cron job or serverless function to check for new events:
import requests

response = requests.get(
    "http://your-backend.com/integration/events",
    params={"asset_id": "Motor-01", "limit": 5}
)

for event in response.json():
    if event["severity"] in ["HIGH", "CRITICAL"]:
        send_alert(event)  # Your notification logic
3

Filter by Severity

Only alert on actionable events:
  • CRITICAL: Immediate action required (health < 25)
  • HIGH: Schedule maintenance within 24-48 hours (health < 50)
  • MODERATE: Monitor closely (health < 75)
For production, consider using InfluxDB alerting rules to trigger notifications directly from the time-series database.

Performance Benchmarks

Expected Response Times

EndpointCold StartWarm
/ping100ms50ms
/health500ms200ms
/system/state300ms100ms
/integration/health-status800ms400ms
/integration/sensor-history2s1s
If response times exceed 2x these values, investigate:
  • InfluxDB query performance (check for missing indexes)
  • Render instance CPU/memory limits (check Render metrics)
  • Network latency (run ping to backend domain)

Throughput Limits

Render Free Tier:
  • 512 MB RAM
  • 0.1 CPU core
  • Limit: ~10 requests/second
InfluxDB Cloud Free Tier:
  • 10,000 writes/5 minutes
  • Limit: ~33 writes/second
At 100Hz ingestion for 1 asset, you send 100 writes/second. This exceeds the InfluxDB free tier. Use batch writes (100 points per API call) to stay within limits.

Monitoring Checklist

1

Daily: Check Health Dashboard

  • Verify STATUS: LIVE badge is green
  • Scan health scores for all assets
  • Review recent operator logs
2

Weekly: Review Logs

  • Check for InfluxDB connection errors
  • Verify model loading logs on restarts
  • Investigate any ERROR or WARNING entries
3

Monthly: Performance Audit

  • Run benchmark script to validate model accuracy
  • Review false positive/negative rates
  • Check Render/Vercel deployment metrics
4

Quarterly: Model Retraining

  • Retrain models with fresh healthy data
  • Update baselines if operating conditions changed
  • Document retraining events

Troubleshooting

Common issues and solutions

Model Retraining

How to retrain ML models

Health Assessment

Understanding health scores

API Reference

Complete API documentation

Build docs developers (and LLMs) love