Overview
The Predictive Maintenance System uses Isolation Forest models to detect anomalies. Over time, as equipment behavior changes or new operating conditions emerge, you may need to retrain models to maintain accuracy.When to Retrain
Retrain your models when you observe:False Positive Rate Increases
False Positive Rate Increases
If the system flags normal operations as anomalies frequently (>5% of healthy periods), the baseline may be outdated.Solution: Retrain with fresh healthy data to recalibrate the normal operating envelope.
Missed Anomalies (False Negatives)
Missed Anomalies (False Negatives)
If known faults are not detected, the model may not have seen similar patterns during training.Solution: Expand training data to include diverse operating conditions.
Operating Condition Changes
Operating Condition Changes
After equipment upgrades, load profile changes, or seasonal variations.Example: A motor running at higher RPM after a gearbox replacement needs a new baseline.
Scheduled Retraining
Scheduled Retraining
Industry best practice: retrain every 30-90 days to prevent model drift.
Batch Model Retraining
The system uses a 16-feature batch model (v3) as the primary detector. It extracts statistical features from 100Hz raw sensor data.Using the Retraining Script
Theretrain_batch_model.py script fetches raw 100Hz data from InfluxDB, extracts batch features, and trains a new model.
Basic Usage
Command Options
| Parameter | Default | Description |
|---|---|---|
--asset | Motor-01 | Asset ID to retrain |
--seconds | 300 | Seconds of historical data to use |
--window | 100 | Points per window (100Hz = 1 second) |
--save-dir | backend/models | Directory to save model |
Example: Retrain with 10 minutes of data
Programmatic Retraining
You can also import and call the retraining function from your own scripts:Model Versioning
The system saves models with a version tag in the filename:Version History
| Version | Features | Input | F1 Score | Notes |
|---|---|---|---|---|
| v1 | 4 | 1Hz raw signals | 62% | Legacy, deprecated |
| v2 | 6 | 1Hz derived features | 78% | Legacy fallback |
| v3 | 16 | 100Hz batch statistics | 99.6% | Current primary |
The v3 batch model detects jitter faults (normal means, abnormal variance) that v2 cannot detect.
Manual Version Management
To preserve model history, backup before retraining:Performance Benchmarking
After retraining, validate model performance using the benchmark script.Running the Benchmark
Benchmark Output
The script generates synthetic healthy and faulty data, then computes:Success Criteria
Healthy Mean Score < 0.15
Average anomaly score for healthy data should be low to minimize false alarms.
Feature Importance
The batch model uses 16 statistical features extracted from 1-second windows:| Signal | Features |
|---|---|
| Voltage | mean, std, peak_to_peak, rms |
| Current | mean, std, peak_to_peak, rms |
| Power Factor | mean, std, peak_to_peak, rms |
| Vibration | mean, std, peak_to_peak, rms |
Why These Features Matter
mean
Captures average level (e.g., voltage drift)
std (standard deviation)
Detects jitter/instability (e.g., erratic vibration)
peak_to_peak
Identifies transient spikes (e.g., voltage surges)
rms (root mean square)
Measures signal energy (e.g., vibration intensity)
Retraining Workflow
Collect Healthy Data
Run the system in monitoring mode for at least 5 minutes during normal operations. Ensure no faults are injected.
Verify Data Quality
Check that InfluxDB has sufficient raw 100Hz points:Target: At least 30,000 points (300 seconds × 100Hz).
Troubleshooting
Error: Insufficient raw data
Error: Insufficient raw data
Symptom:Cause: InfluxDB query returned fewer points than required.Solution:
- Increase
--secondsparameter - Verify data generator is running at 100Hz
- Check InfluxDB retention policy hasn’t deleted old data
Error: Only X valid feature windows. Need >= 10 for training
Error: Only X valid feature windows. Need >= 10 for training
Symptom:Cause: Most windows had invalid/NaN values due to cold-start or missing fields.Solution:
- Use more recent data (cold-start windows have NaN features)
- Verify all 4 sensor fields (voltage, current, power_factor, vibration) are present
Model overfits: High train accuracy, low test accuracy
Model overfits: High train accuracy, low test accuracy
Symptom: Benchmark shows 100% accuracy but production has many false positives.Cause: Training data doesn’t represent full operating diversity.Solution:
- Increase training duration to 10-30 minutes
- Include data from different load conditions (startup, steady-state, shutdown)
- Adjust Isolation Forest
contaminationparameter (default: 0.05)
Advanced: Custom Feature Engineering
To add custom features to the batch model:- Edit
backend/ml/batch_features.py - Add your feature to
BATCH_FEATURE_NAMES: - Implement extraction logic in
extract_batch_features(): - Retrain the model to incorporate the new feature
Best Practices
Use Recent Data
Train on data from the last 7-30 days. Older data may not reflect current operating conditions.
Validate Before Deploy
Always run benchmarks before deploying retrained models to production.
Document Retraining Events
Log why and when you retrained (e.g., “Retrained after gearbox replacement on 2026-03-02”).
Monitor Post-Deployment
Watch false positive rates for 24-48 hours after deploying a new model.
Related Resources
Baseline Training
Learn how baseline profiles are built
Feature Engineering
Deep dive into the 16 batch features
Dual Model Architecture
Understand v2 vs v3 model differences
Monitoring
Monitor model performance in production