Production Monitoring

Overview

Monitoring the Predictive Maintenance System ensures reliable anomaly detection and early identification of infrastructure issues. This guide covers health checks, log watching, performance metrics, and cloud deployment monitoring.

Health Checks

Backend Health Endpoint

The /health endpoint provides a quick system status check:

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "db_connected": true,
  "timestamp": "2026-03-02T14:30:00Z"
}

Field	Description
`status`	`healthy` or `degraded`
`db_connected`	InfluxDB connection status
`timestamp`	Server time (UTC)

Use this endpoint for uptime monitoring tools like UptimeRobot, Pingdom, or Datadog.

Keep-Alive Ping

The /ping endpoint is a lightweight health check (no database queries):

curl http://localhost:8000/ping

Response:

{
  "status": "ok"
}

The frontend sends a 10-minute heartbeat to /ping to prevent Render free-tier cold starts (see Render Deployment below).

System State Monitoring

Check Current State

The /system/state endpoint shows operational mode:

curl http://localhost:8000/system/state

Response:

{
  "state": "MONITORING",
  "asset_id": "Motor-01",
  "monitoring_start_time": "2026-03-02T14:00:00Z",
  "calibration_status": "complete",
  "baseline_sample_count": 500,
  "model_version": "v3"
}

System States

IDLE

No active monitoring. System waiting for calibration or data ingestion.Action: Start calibration or monitoring.

CALIBRATING

Collecting healthy baseline data. Models not yet trained.Action: Wait for calibration to complete (typically 5-10 minutes).

MONITORING

Active anomaly detection. Models trained and scoring incoming data.Action: Normal operations. Monitor health scores.

FAULT_INJECTED

Fault simulation active. Anomalies are intentional.Action: Demo/testing mode. Clear fault to resume normal monitoring.

Log Watching

Backend Logs

Docker
Systemd (Linux)
Render (Cloud)

# View live logs
docker-compose logs -f backend

# Last 100 lines
docker-compose logs --tail=100 backend

# Specific timestamp range
docker-compose logs --since="2026-03-02T14:00:00" backend

# View live logs
sudo journalctl -u predictive-maintenance -f

# Last 100 lines
sudo journalctl -u predictive-maintenance -n 100

# Today's logs
sudo journalctl -u predictive-maintenance --since today

Key Log Patterns

[Calibration] Baseline built successfully

Meaning: Model training completed.Log Example:

[Calibration] Baseline built successfully for Motor-01: 500 samples
[Calibration] Batch detector trained on 480 windows (16 features)

Action: System ready for monitoring.

[Health] Health score: X (RISK_LEVEL)

Meaning: Periodic health assessment logged.Log Example:

[Health] Health score: 42 (HIGH) - Vibration variance elevated

Action: If risk is HIGH or CRITICAL, investigate explanations.

ERROR fetching data

Meaning: InfluxDB query failed.Log Example:

[Retrain] ERROR fetching data: unauthorized access

Action: Check InfluxDB credentials and network connectivity (see Troubleshooting).

Model loaded from backend/models/...

Meaning: Pre-trained model restored on startup.Log Example:

[Detector] Model loaded from backend/models/Motor-01_batch_detector_v3.pkl

Action: Normal. Model persists across restarts.

Performance Metrics

Real-Time Metrics Dashboard

The frontend Dashboard tab displays live performance indicators:

Health Score

0-100 scale derived from Degradation Index (DI)

Anomaly Score

0.0-1.0 probability from Isolation Forest

Damage Rate

DI increase per hour (% degradation/hour)

RUL (Remaining Useful Life)

Projected hours until maintenance required

Historical Metrics Query

Retrieve time-series metrics via API:

curl -X GET "http://localhost:8000/integration/sensor-history?asset_id=Motor-01&seconds=3600" \
  | jq '.[] | {timestamp, health_score, anomaly_score}'

Response:

[
  {
    "timestamp": "2026-03-02T14:00:00Z",
    "health_score": 95,
    "anomaly_score": 0.08
  },
  {
    "timestamp": "2026-03-02T14:01:00Z",
    "health_score": 93,
    "anomaly_score": 0.12
  }
]

Baseline Metrics

Check calibrated baseline targets:

curl http://localhost:8000/integration/health-status?asset_id=Motor-01 \
  | jq '.baseline_targets'

Response:

{
  "voltage_v": 230.0,
  "current_a": 15.0,
  "power_factor": 0.92,
  "vibration_g": 0.15
}

These values are displayed on the frontend Status Cards for instant comparison with live readings.

Deployment Status Monitoring

Render (Backend) Deployment Status

The system is deployed on Render Free Tier, which has specific monitoring considerations:

Cold Start Behavior

Render free-tier services spin down after 15 minutes of inactivity. The first request after spin-down takes 30-60 seconds to respond.

Cold Start Detection:

Frontend shows STATUS: OFFLINE badge
First API request returns HTTP 503 or timeout
Render logs show: Starting service...

Cold Start Mitigation: The frontend automatically sends a keep-alive heartbeat:

// Keep-alive ping every 10 minutes
setInterval(() => {
  fetch(`${API_URL}/ping`).catch(() => {});
}, 10 * 60 * 1000);

For production deployments, upgrade to Render Starter ($7/month) to eliminate cold starts.

Monitoring Render Health

Via Dashboard:
- Open Render Dashboard
- Check service status (green = running, gray = spun down)
- Review Events tab for deploy/restart logs

Via API:

curl -H "Authorization: Bearer $RENDER_API_KEY" \
  https://api.render.com/v1/services/$SERVICE_ID

Via Application:
- Frontend STATUS: LIVE badge (green = healthy, red = offline)
- Check /health endpoint response time (500ms = warm, >2s = cold start)

Vercel (Frontend) Deployment Status

The frontend is deployed on Vercel Free Tier.

Monitoring Vercel Health

Via Dashboard:
- Open Vercel Dashboard
- Check Deployments tab for build status
- Review Analytics for traffic and error rates
Via Application:
- Open https://predictive-maintenance-ten.vercel.app/
- Check browser console for errors
- Verify STATUS: LIVE badge appears within 5 seconds

Vercel Edge Network has no cold starts. The frontend loads instantly worldwide.

InfluxDB Cloud Status

Monitor InfluxDB connectivity:

curl -X GET "$INFLUX_URL/health" \
  -H "Authorization: Token $INFLUX_TOKEN"

Response:

{
  "name": "influxdb",
  "message": "ready for queries and writes",
  "status": "pass",
  "checks": [],
  "version": "v2.7.1",
  "commit": "..."
}

status: pass

InfluxDB is healthy and accepting requests.

status: fail

Database is down. Check InfluxDB Status Page.

401 Unauthorized

Invalid token. Verify INFLUX_TOKEN environment variable.

Alerting and Notifications

Event Webhook Integration

The /integration/events endpoint streams state transitions:

curl -X GET "http://localhost:8000/integration/events?asset_id=Motor-01&limit=10"

Response:

[
  {
    "event_type": "ANOMALY_DETECTED",
    "timestamp": "2026-03-02T14:30:00Z",
    "asset_id": "Motor-01",
    "message": "High vibration variance (mechanical jitter): σ=0.17g",
    "severity": "HIGH",
    "health_score": 42
  }
]

Setting Up Alerts

Choose a Notification Service

Options: Slack, PagerDuty, email (via SendGrid), SMS (via Twilio)

Poll Events Endpoint

Set up a cron job or serverless function to check for new events:

import requests

response = requests.get(
    "http://your-backend.com/integration/events",
    params={"asset_id": "Motor-01", "limit": 5}
)

for event in response.json():
    if event["severity"] in ["HIGH", "CRITICAL"]:
        send_alert(event)  # Your notification logic

Filter by Severity

Only alert on actionable events:

CRITICAL: Immediate action required (health < 25)
HIGH: Schedule maintenance within 24-48 hours (health < 50)
MODERATE: Monitor closely (health < 75)

For production, consider using InfluxDB alerting rules to trigger notifications directly from the time-series database.

Performance Benchmarks

Expected Response Times

Endpoint	Cold Start	Warm
`/ping`	100ms	50ms
`/health`	500ms	200ms
`/system/state`	300ms	100ms
`/integration/health-status`	800ms	400ms
`/integration/sensor-history`	2s	1s

If response times exceed 2x these values, investigate:

InfluxDB query performance (check for missing indexes)
Render instance CPU/memory limits (check Render metrics)
Network latency (run ping to backend domain)

Throughput Limits

Render Free Tier:

512 MB RAM
0.1 CPU core
Limit: ~10 requests/second

InfluxDB Cloud Free Tier:

10,000 writes/5 minutes
Limit: ~33 writes/second

At 100Hz ingestion for 1 asset, you send 100 writes/second. This exceeds the InfluxDB free tier. Use batch writes (100 points per API call) to stay within limits.

Monitoring Checklist

Daily: Check Health Dashboard

Verify STATUS: LIVE badge is green
Scan health scores for all assets
Review recent operator logs

Weekly: Review Logs

Check for InfluxDB connection errors
Verify model loading logs on restarts
Investigate any ERROR or WARNING entries

Monthly: Performance Audit

Run benchmark script to validate model accuracy
Review false positive/negative rates
Check Render/Vercel deployment metrics

Quarterly: Model Retraining

Retrain models with fresh healthy data
Update baselines if operating conditions changed
Document retraining events

Troubleshooting

Common issues and solutions

Model Retraining

How to retrain ML models

Health Assessment

Understanding health scores

API Reference

Complete API documentation

Integration

Advanced

Overview

Health Checks

Backend Health Endpoint

Keep-Alive Ping

System State Monitoring

Check Current State

System States

Log Watching

Backend Logs

Key Log Patterns

Performance Metrics

Real-Time Metrics Dashboard

Health Score

Anomaly Score

Damage Rate

RUL (Remaining Useful Life)

Historical Metrics Query

Baseline Metrics

Deployment Status Monitoring

Render (Backend) Deployment Status

Cold Start Behavior

Monitoring Render Health

Vercel (Frontend) Deployment Status

Monitoring Vercel Health

InfluxDB Cloud Status

Alerting and Notifications

Event Webhook Integration

Setting Up Alerts

Performance Benchmarks

Expected Response Times

Throughput Limits

Monitoring Checklist

Troubleshooting

Model Retraining

Health Assessment

API Reference

Build docs developers (and LLMs) love

Integration

Advanced

​Overview

​Health Checks

​Backend Health Endpoint

​Keep-Alive Ping

​System State Monitoring

​Check Current State

​System States

​Log Watching

​Backend Logs

​Key Log Patterns

​Performance Metrics

​Real-Time Metrics Dashboard

Health Score

Anomaly Score

Damage Rate

RUL (Remaining Useful Life)

​Historical Metrics Query

​Baseline Metrics

​Deployment Status Monitoring

​Render (Backend) Deployment Status

​Cold Start Behavior

​Monitoring Render Health

​Vercel (Frontend) Deployment Status

​Monitoring Vercel Health

​InfluxDB Cloud Status

​Alerting and Notifications

​Event Webhook Integration

​Setting Up Alerts

​Performance Benchmarks

​Expected Response Times

​Throughput Limits

​Monitoring Checklist

​Related Resources

Troubleshooting

Model Retraining

Health Assessment

API Reference

Build docs developers (and LLMs) love

Overview

Health Checks

Backend Health Endpoint

Keep-Alive Ping

System State Monitoring

Check Current State

System States

Log Watching

Backend Logs

Key Log Patterns

Performance Metrics

Real-Time Metrics Dashboard

Historical Metrics Query

Baseline Metrics

Deployment Status Monitoring

Render (Backend) Deployment Status

Cold Start Behavior

Monitoring Render Health

Vercel (Frontend) Deployment Status

Monitoring Vercel Health

InfluxDB Cloud Status

Alerting and Notifications

Event Webhook Integration

Setting Up Alerts

Performance Benchmarks

Expected Response Times

Throughput Limits

Monitoring Checklist

Related Resources