Skip to main content

Overview

The Predictive Maintenance System uses Isolation Forest models to detect anomalies. Over time, as equipment behavior changes or new operating conditions emerge, you may need to retrain models to maintain accuracy.

When to Retrain

Retrain your models when you observe:
If the system flags normal operations as anomalies frequently (>5% of healthy periods), the baseline may be outdated.Solution: Retrain with fresh healthy data to recalibrate the normal operating envelope.
If known faults are not detected, the model may not have seen similar patterns during training.Solution: Expand training data to include diverse operating conditions.
After equipment upgrades, load profile changes, or seasonal variations.Example: A motor running at higher RPM after a gearbox replacement needs a new baseline.
Industry best practice: retrain every 30-90 days to prevent model drift.
Only retrain models using healthy data. Including faulty periods will corrupt the baseline and cause the model to accept abnormal behavior as normal.

Batch Model Retraining

The system uses a 16-feature batch model (v3) as the primary detector. It extracts statistical features from 100Hz raw sensor data.

Using the Retraining Script

The retrain_batch_model.py script fetches raw 100Hz data from InfluxDB, extracts batch features, and trains a new model.

Basic Usage

python -m scripts.retrain_batch_model --asset Motor-01 --seconds 300

Command Options

ParameterDefaultDescription
--assetMotor-01Asset ID to retrain
--seconds300Seconds of historical data to use
--window100Points per window (100Hz = 1 second)
--save-dirbackend/modelsDirectory to save model

Example: Retrain with 10 minutes of data

python -m scripts.retrain_batch_model \
  --asset Motor-01 \
  --seconds 600 \
  --save-dir backend/models
Use at least 300 seconds (5 minutes) of healthy data to ensure sufficient training samples (~300 windows).

Programmatic Retraining

You can also import and call the retraining function from your own scripts:
from scripts.retrain_batch_model import retrain_batch_model

# Retrain and get the detector instance
detector = retrain_batch_model(
    asset_id="Motor-01",
    range_seconds=600,
    window_size=100,
    save_dir="backend/models"
)

print(f"Model trained on {detector._training_sample_count} windows")

Model Versioning

The system saves models with a version tag in the filename:
backend/models/Motor-01_batch_detector_v3.pkl

Version History

VersionFeaturesInputF1 ScoreNotes
v141Hz raw signals62%Legacy, deprecated
v261Hz derived features78%Legacy fallback
v316100Hz batch statistics99.6%Current primary
The v3 batch model detects jitter faults (normal means, abnormal variance) that v2 cannot detect.

Manual Version Management

To preserve model history, backup before retraining:
cp backend/models/Motor-01_batch_detector_v3.pkl \
   backend/models/Motor-01_batch_detector_v3_backup_2026-03-02.pkl

python -m scripts.retrain_batch_model --asset Motor-01 --seconds 600

Performance Benchmarking

After retraining, validate model performance using the benchmark script.

Running the Benchmark

python -m scripts.benchmark_model

Benchmark Output

The script generates synthetic healthy and faulty data, then computes:
=== RESULTS ===

--- Healthy Data (TARGET: mean < 0.15) ---
   Mean score: 0.082
   Std dev:    0.041

--- Faulty Data (TARGET: mean > 0.6) ---
   Mean score: 0.893
   Std dev:    0.124

--- Classification Metrics (threshold=0.3) ---
   ACCURACY:         98.7%
   PRECISION:        96.2%
   RECALL:           100.0%
   F1-SCORE:         98.0%
   HEALTHY STABILITY: 99.0%

Success Criteria

1

Healthy Stability ≥ 95%

At least 95% of healthy samples should score below threshold (0.3).
2

Precision ≥ 80%

When the model flags an anomaly, it should be correct 80% of the time.
3

Healthy Mean Score < 0.15

Average anomaly score for healthy data should be low to minimize false alarms.
4

Score Separation > 0.4

Gap between faulty and healthy mean scores indicates clear decision boundary.
If Healthy Stability < 95%, the model is too sensitive. Retrain with more diverse healthy data or adjust contamination parameter.

Feature Importance

The batch model uses 16 statistical features extracted from 1-second windows:
SignalFeatures
Voltagemean, std, peak_to_peak, rms
Currentmean, std, peak_to_peak, rms
Power Factormean, std, peak_to_peak, rms
Vibrationmean, std, peak_to_peak, rms

Why These Features Matter

mean

Captures average level (e.g., voltage drift)

std (standard deviation)

Detects jitter/instability (e.g., erratic vibration)

peak_to_peak

Identifies transient spikes (e.g., voltage surges)

rms (root mean square)

Measures signal energy (e.g., vibration intensity)

Retraining Workflow

1

Collect Healthy Data

Run the system in monitoring mode for at least 5 minutes during normal operations. Ensure no faults are injected.
# Start monitoring
curl -X POST http://localhost:8000/system/start-monitoring \
  -H "Content-Type: application/json" \
  -d '{"asset_id": "Motor-01", "duration_seconds": 600}'
2

Verify Data Quality

Check that InfluxDB has sufficient raw 100Hz points:
# Query point count
curl -X GET "http://localhost:8000/integration/sensor-history?asset_id=Motor-01&seconds=600" | jq 'length'
Target: At least 30,000 points (300 seconds × 100Hz).
3

Run Retraining Script

python -m scripts.retrain_batch_model --asset Motor-01 --seconds 600
Output:
[Retrain] Querying 600s of raw 100Hz data for Motor-01...
[Retrain] Fetched 60000 raw points
[Retrain] 60000 raw points → ~600 windows of 100
[Retrain] Extracted 600 feature vectors (16 features each)
[Retrain] Model trained successfully on 600 windows
[Retrain] Model saved to backend/models/Motor-01_batch_detector_v3.pkl
✅ Retraining complete. Model version: v3 (batch features)
4

Benchmark New Model

python -m scripts.benchmark_model
Ensure all success criteria pass before deploying to production.
5

Deploy to Production

Restart the backend to load the new model:
# Docker
docker-compose restart backend

# Systemd
sudo systemctl restart predictive-maintenance

Troubleshooting

Symptom:
ValueError: Insufficient raw data: got 50 points, need at least 100 (1 window)
Cause: InfluxDB query returned fewer points than required.Solution:
  • Increase --seconds parameter
  • Verify data generator is running at 100Hz
  • Check InfluxDB retention policy hasn’t deleted old data
Symptom:
ValueError: Only 3 valid feature windows. Need >= 10 for training.
Cause: Most windows had invalid/NaN values due to cold-start or missing fields.Solution:
  • Use more recent data (cold-start windows have NaN features)
  • Verify all 4 sensor fields (voltage, current, power_factor, vibration) are present
Symptom: Benchmark shows 100% accuracy but production has many false positives.Cause: Training data doesn’t represent full operating diversity.Solution:
  • Increase training duration to 10-30 minutes
  • Include data from different load conditions (startup, steady-state, shutdown)
  • Adjust Isolation Forest contamination parameter (default: 0.05)

Advanced: Custom Feature Engineering

To add custom features to the batch model:
  1. Edit backend/ml/batch_features.py
  2. Add your feature to BATCH_FEATURE_NAMES:
    BATCH_FEATURE_NAMES = [
        # ... existing features ...
        "voltage_v_kurtosis",  # New feature
    ]
    
  3. Implement extraction logic in extract_batch_features():
    from scipy.stats import kurtosis
    
    features["voltage_v_kurtosis"] = kurtosis(window["voltage_v"])
    
  4. Retrain the model to incorporate the new feature
Changing feature definitions invalidates existing models. You must retrain from scratch.

Best Practices

Use Recent Data

Train on data from the last 7-30 days. Older data may not reflect current operating conditions.

Validate Before Deploy

Always run benchmarks before deploying retrained models to production.

Document Retraining Events

Log why and when you retrained (e.g., “Retrained after gearbox replacement on 2026-03-02”).

Monitor Post-Deployment

Watch false positive rates for 24-48 hours after deploying a new model.

Baseline Training

Learn how baseline profiles are built

Feature Engineering

Deep dive into the 16 batch features

Dual Model Architecture

Understand v2 vs v3 model differences

Monitoring

Monitor model performance in production

Build docs developers (and LLMs) love