Overview
Monitoring the Predictive Maintenance System ensures reliable anomaly detection and early identification of infrastructure issues. This guide covers health checks, log watching, performance metrics, and cloud deployment monitoring.Health Checks
Backend Health Endpoint
The/health endpoint provides a quick system status check:
| Field | Description |
|---|---|
status | healthy or degraded |
db_connected | InfluxDB connection status |
timestamp | Server time (UTC) |
Keep-Alive Ping
The/ping endpoint is a lightweight health check (no database queries):
The frontend sends a 10-minute heartbeat to
/ping to prevent Render free-tier cold starts (see Render Deployment below).System State Monitoring
Check Current State
The/system/state endpoint shows operational mode:
System States
IDLE
IDLE
No active monitoring. System waiting for calibration or data ingestion.Action: Start calibration or monitoring.
CALIBRATING
CALIBRATING
Collecting healthy baseline data. Models not yet trained.Action: Wait for calibration to complete (typically 5-10 minutes).
MONITORING
MONITORING
Active anomaly detection. Models trained and scoring incoming data.Action: Normal operations. Monitor health scores.
FAULT_INJECTED
FAULT_INJECTED
Fault simulation active. Anomalies are intentional.Action: Demo/testing mode. Clear fault to resume normal monitoring.
Log Watching
Backend Logs
- Docker
- Systemd (Linux)
- Render (Cloud)
Key Log Patterns
[Calibration] Baseline built successfully
[Calibration] Baseline built successfully
Meaning: Model training completed.Log Example:Action: System ready for monitoring.
[Health] Health score: X (RISK_LEVEL)
[Health] Health score: X (RISK_LEVEL)
Meaning: Periodic health assessment logged.Log Example:Action: If risk is HIGH or CRITICAL, investigate explanations.
ERROR fetching data
ERROR fetching data
Meaning: InfluxDB query failed.Log Example:Action: Check InfluxDB credentials and network connectivity (see Troubleshooting).
Model loaded from backend/models/...
Model loaded from backend/models/...
Meaning: Pre-trained model restored on startup.Log Example:Action: Normal. Model persists across restarts.
Performance Metrics
Real-Time Metrics Dashboard
The frontend Dashboard tab displays live performance indicators:Health Score
0-100 scale derived from Degradation Index (DI)
Anomaly Score
0.0-1.0 probability from Isolation Forest
Damage Rate
DI increase per hour (% degradation/hour)
RUL (Remaining Useful Life)
Projected hours until maintenance required
Historical Metrics Query
Retrieve time-series metrics via API:Baseline Metrics
Check calibrated baseline targets:Deployment Status Monitoring
Render (Backend) Deployment Status
The system is deployed on Render Free Tier, which has specific monitoring considerations:Cold Start Behavior
Cold Start Detection:- Frontend shows STATUS: OFFLINE badge
- First API request returns HTTP 503 or timeout
- Render logs show:
Starting service...
Monitoring Render Health
-
Via Dashboard:
- Open Render Dashboard
- Check service status (green = running, gray = spun down)
- Review Events tab for deploy/restart logs
-
Via API:
-
Via Application:
- Frontend STATUS: LIVE badge (green = healthy, red = offline)
- Check
/healthendpoint response time (500ms = warm, >2s = cold start)
Vercel (Frontend) Deployment Status
The frontend is deployed on Vercel Free Tier.Monitoring Vercel Health
-
Via Dashboard:
- Open Vercel Dashboard
- Check Deployments tab for build status
- Review Analytics for traffic and error rates
-
Via Application:
- Open https://predictive-maintenance-ten.vercel.app/
- Check browser console for errors
- Verify STATUS: LIVE badge appears within 5 seconds
Vercel Edge Network has no cold starts. The frontend loads instantly worldwide.
InfluxDB Cloud Status
Monitor InfluxDB connectivity:status: pass
status: pass
InfluxDB is healthy and accepting requests.
status: fail
status: fail
Database is down. Check InfluxDB Status Page.
401 Unauthorized
401 Unauthorized
Alerting and Notifications
Event Webhook Integration
The/integration/events endpoint streams state transitions:
Setting Up Alerts
Performance Benchmarks
Expected Response Times
| Endpoint | Cold Start | Warm |
|---|---|---|
/ping | 100ms | 50ms |
/health | 500ms | 200ms |
/system/state | 300ms | 100ms |
/integration/health-status | 800ms | 400ms |
/integration/sensor-history | 2s | 1s |
Throughput Limits
Render Free Tier:- 512 MB RAM
- 0.1 CPU core
- Limit: ~10 requests/second
- 10,000 writes/5 minutes
- Limit: ~33 writes/second
At 100Hz ingestion for 1 asset, you send 100 writes/second. This exceeds the InfluxDB free tier. Use batch writes (100 points per API call) to stay within limits.
Monitoring Checklist
Daily: Check Health Dashboard
- Verify STATUS: LIVE badge is green
- Scan health scores for all assets
- Review recent operator logs
Weekly: Review Logs
- Check for InfluxDB connection errors
- Verify model loading logs on restarts
- Investigate any ERROR or WARNING entries
Monthly: Performance Audit
- Run benchmark script to validate model accuracy
- Review false positive/negative rates
- Check Render/Vercel deployment metrics
Related Resources
Troubleshooting
Common issues and solutions
Model Retraining
How to retrain ML models
Health Assessment
Understanding health scores
API Reference
Complete API documentation