Troubleshooting

Overview

This guide covers common issues encountered when operating the Hospital Data Analysis Platform, along with diagnostic strategies and solutions.

Common Failure Modes

1. Missing or Malformed Input Columns

Symptom: KeyError or AttributeError during feature extraction

KeyError: 'age'
AttributeError: 'DataFrame' object has no attribute 'bmi'

Root Cause: Input CSV files don’t contain expected feature columns defined in CONFIG.feature_columns. Diagnosis:

import pandas as pd
from config import CONFIG

# Load your data
df = pd.read_csv(CONFIG.data_dir / "your_file.csv")

# Check for missing columns
expected = set(CONFIG.feature_columns + [CONFIG.target_risk, CONFIG.target_outcome])
actual = set(df.columns)
missing = expected - actual

if missing:
    print(f"Missing columns: {missing}")
else:
    print("All required columns present")

Solutions:

Update configuration to match your data:

CONFIG.feature_columns = ["age", "height", "weight"]  # Match your CSV
CONFIG.target_risk = "diagnosis"
CONFIG.target_outcome = "blood_test"

Preprocess data to add missing columns:

# Calculate BMI if missing
if "bmi" not in df.columns:
    df["bmi"] = df["weight"] / (df["height"] / 100) ** 2

Validate schema before processing:

def validate_schema(df, config):
    required = set(config.feature_columns + [config.target_risk, config.target_outcome])
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

2. Memory Limit Exceeded

Symptom: MemoryError or system becomes unresponsive

MemoryError: Unable to allocate array

Root Cause: Dataset size exceeds CONFIG.hardware_memory_limit_mb or system memory. Diagnosis:

import pandas as pd
from config import CONFIG

# Check DataFrame memory usage
df = pd.read_csv(CONFIG.data_dir / "large_file.csv")
memory_mb = df.memory_usage(deep=True).sum() / (1024 ** 2)

print(f"DataFrame size: {memory_mb:.2f} MB")
print(f"Memory limit: {CONFIG.hardware_memory_limit_mb} MB")

if memory_mb > CONFIG.hardware_memory_limit_mb:
    print("WARNING: Dataset exceeds memory limit")

Solutions:

Reduce chunk size for streaming:

CONFIG.stream_chunk_size = 8  # Reduce from default 16

Use chunked processing:

chunks = pd.read_csv(CONFIG.data_dir / "large_file.csv", chunksize=1000)
results = []
for chunk in chunks:
    result = process_chunk(chunk)
    results.append(result)

Optimize data types:

# Convert float64 to float32
for col in df.select_dtypes(include=['float64']).columns:
    df[col] = df[col].astype('float32')

# Use categorical for low-cardinality columns
df['diagnosis'] = df['diagnosis'].astype('category')

Increase memory limit (if hardware allows):
```
CONFIG.hardware_memory_limit_mb = 512
```

3. ONNX Export Failures

Symptom: Error during model export to ONNX format

ONNXConversionError: Unsupported operator
AttributeError: 'CustomEstimator' object has no attribute 'coef_'

Root Cause: Model uses operators or architectures not supported by ONNX converter. Diagnosis:

from skl2onnx import to_onnx
import traceback

try:
    onnx_model = to_onnx(model, X_sample)
    print("Export successful")
except Exception as e:
    print(f"Export failed: {type(e).__name__}")
    print(f"Message: {str(e)}")
    traceback.print_exc()

Solutions:

Use supported estimators: Stick to scikit-learn models with good ONNX support:

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Well-supported models
model = RandomForestClassifier()
# or
model = LogisticRegression()

Check model attributes before export:

# Ensure model is fitted
if not hasattr(model, 'classes_'):
    raise ValueError("Model not fitted")

Simplify custom estimators:

# Avoid complex custom transformers
# Use sklearn.preprocessing.FunctionTransformer instead
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(lambda x: x ** 2)

Fall back to pickle if ONNX export fails:

import pickle

try:
    onnx_model = to_onnx(model, X_sample)
    # Save ONNX
except Exception:
    print("ONNX export failed, using pickle")
    with open(CONFIG.output_dir / "model.pkl", "wb") as f:
        pickle.dump(model, f)

4. Non-Reproducible Results

Symptom: Different results on repeated runs despite setting seed Root Cause: Random seed not propagated correctly, or non-deterministic operations. Diagnosis:

from utils.reproducibility import reproducibility_context, set_global_seed
from config import CONFIG
import json

# Run 1
set_global_seed(CONFIG.random_seed)
context1 = reproducibility_context(CONFIG)
result1 = train_model()

# Run 2
set_global_seed(CONFIG.random_seed)
context2 = reproducibility_context(CONFIG)
result2 = train_model()

# Compare contexts
print("Context 1:", json.dumps(context1, indent=2))
print("Context 2:", json.dumps(context2, indent=2))
print(f"Results equal: {result1 == result2}")

Solutions:

Set seed early:

from utils.reproducibility import set_global_seed
from config import CONFIG

# Call this BEFORE any imports that use randomness
set_global_seed(CONFIG.random_seed)

Check environment variables:

import os

print("OMP_NUM_THREADS:", os.environ.get("OMP_NUM_THREADS"))
print("PYTHONHASHSEED:", os.environ.get("PYTHONHASHSEED"))

# Should all be "1" or the seed value

Verify scikit-learn random_state:

from sklearn.ensemble import RandomForestClassifier

# Always pass random_state
model = RandomForestClassifier(random_state=CONFIG.random_seed)

Disable parallel processing in scikit-learn:

# Set n_jobs=1 for reproducibility
model = RandomForestClassifier(n_jobs=1, random_state=CONFIG.random_seed)

5. Benchmark Variance Too High

Symptom: Wide confidence intervals, unstable benchmark results Root Cause: System load, insufficient iterations, or inherent algorithm variance. Diagnosis:

from evaluation.benchmark import run_repeated_benchmark
from config import CONFIG

result = run_repeated_benchmark(train_and_evaluate, "accuracy", runs=CONFIG.benchmark_runs)

coeff_of_variation = result.metric_std / result.metric_mean
relative_margin = result.metric_ci_margin / result.metric_mean

print(f"Coefficient of variation: {coeff_of_variation:.2%}")
print(f"Relative CI margin: {relative_margin:.2%}")

if relative_margin > 0.05:  # More than 5%
    print("WARNING: High variance detected")

Solutions:

Increase benchmark runs:

CONFIG.benchmark_runs = 20  # Up from default 5

Run on idle system:

# Linux: Check system load
top

# Kill unnecessary processes before benchmarking

Use higher confidence level for critical measurements:

CONFIG.confidence_level = 0.99  # 99% confidence

Profile system resources:

import psutil

print(f"CPU usage: {psutil.cpu_percent()}%")
print(f"Memory usage: {psutil.virtual_memory().percent}%")

# Wait for system to stabilize
if psutil.cpu_percent() > 50:
    print("WARNING: High CPU usage, benchmarks may be unreliable")

6. Alert Threshold Drift

Symptom: Excessive false positives or missed anomalies Root Cause: Alert thresholds not calibrated for current data distribution. Diagnosis:

import pandas as pd
from config import CONFIG

# Load recent predictions
df = pd.read_csv(CONFIG.output_dir / "predictions.csv")

# Analyze distribution
print(df["prediction_score"].describe())
print(f"\n95th percentile: {df['prediction_score'].quantile(0.95)}")
print(f"99th percentile: {df['prediction_score'].quantile(0.99)}")

# Check alert rate
alert_threshold = 0.8
alert_rate = (df["prediction_score"] > alert_threshold).mean()
print(f"\nCurrent alert rate: {alert_rate:.2%}")

Solutions:

Recalibrate thresholds based on recent data:

# Set threshold to 95th percentile
new_threshold = df["prediction_score"].quantile(0.95)
print(f"Recommended threshold: {new_threshold:.3f}")

Use adaptive thresholds:

# Rolling window calibration
window_size = 1000
rolling_threshold = df["prediction_score"].rolling(window_size).quantile(0.95)

Monitor threshold effectiveness:

# Track false positive rate
if "ground_truth" in df.columns:
    fps = ((df["prediction_score"] > alert_threshold) & (df["ground_truth"] == 0)).sum()
    fpr = fps / (df["ground_truth"] == 0).sum()
    print(f"False positive rate: {fpr:.2%}")

Diagnostic Strategies

Enable Verbose Logging

import logging

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(CONFIG.output_dir / "debug.log"),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)
logger.debug("Verbose logging enabled")

Checkpoint Intermediate Results

import pickle
from pathlib import Path

def checkpoint(obj, name: str):
    """Save intermediate results for debugging."""
    checkpoint_dir = CONFIG.output_dir / "checkpoints"
    checkpoint_dir.mkdir(exist_ok=True)
    
    path = checkpoint_dir / f"{name}.pkl"
    with open(path, "wb") as f:
        pickle.dump(obj, f)
    print(f"Checkpoint saved: {path}")

# Usage
preprocessed_data = preprocess(raw_data)
checkpoint(preprocessed_data, "preprocessed_data")

features = extract_features(preprocessed_data)
checkpoint(features, "features")

Validate Data at Each Stage

import pandas as pd
import numpy as np

def validate_dataframe(df: pd.DataFrame, stage: str):
    """Comprehensive DataFrame validation."""
    print(f"\n=== Validation: {stage} ===")
    
    # Check shape
    print(f"Shape: {df.shape}")
    
    # Check for nulls
    null_counts = df.isnull().sum()
    if null_counts.any():
        print(f"WARNING: Null values found:")
        print(null_counts[null_counts > 0])
    
    # Check for infinite values
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if np.isinf(df[col]).any():
            print(f"WARNING: Infinite values in column '{col}'")
    
    # Check for duplicates
    dup_count = df.duplicated().sum()
    if dup_count > 0:
        print(f"WARNING: {dup_count} duplicate rows")
    
    # Memory usage
    memory_mb = df.memory_usage(deep=True).sum() / (1024 ** 2)
    print(f"Memory usage: {memory_mb:.2f} MB")
    
    print(f"=== End Validation ===")

# Usage
validate_dataframe(raw_data, "Raw Input")
validate_dataframe(processed_data, "After Preprocessing")

Profile Performance Bottlenecks

import time
import cProfile
import pstats
from io import StringIO

def profile_function(func, *args, **kwargs):
    """Profile a function and print statistics."""
    profiler = cProfile.Profile()
    profiler.enable()
    
    result = func(*args, **kwargs)
    
    profiler.disable()
    
    # Print statistics
    s = StringIO()
    stats = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
    stats.print_stats(20)  # Top 20 functions
    print(s.getvalue())
    
    return result

# Usage
result = profile_function(train_model, X_train, y_train)

Emergency Recovery

Restore from Checkpoints

import pickle
from pathlib import Path

def load_checkpoint(name: str):
    """Load a saved checkpoint."""
    path = CONFIG.output_dir / "checkpoints" / f"{name}.pkl"
    if not path.exists():
        raise FileNotFoundError(f"Checkpoint not found: {path}")
    
    with open(path, "rb") as f:
        return pickle.load(f)

# Recovery
try:
    result = run_full_pipeline()
except Exception as e:
    print(f"Pipeline failed: {e}")
    print("Attempting recovery from checkpoint...")
    
    # Load last successful stage
    features = load_checkpoint("features")
    result = run_from_features(features)

Safe Mode Execution

from config import CONFIG, SystemConfig

def create_safe_config() -> SystemConfig:
    """Create a conservative configuration for debugging."""
    safe_config = SystemConfig(
        random_seed=42,
        test_size=0.25,
        stream_chunk_size=4,  # Very small chunks
        hardware_memory_limit_mb=64,  # Low memory
        benchmark_runs=2,  # Minimal runs
    )
    return safe_config

# Usage
if DEBUG_MODE:
    CONFIG = create_safe_config()
    print("Running in SAFE MODE with conservative settings")

Getting Help

If you encounter issues not covered in this guide:

Check logs: Review CONFIG.output_dir / "debug.log"
Validate environment: Use reproducibility_context(CONFIG) to capture system state
Isolate the issue: Use checkpoints and validation to identify the failing stage
Compare with baseline: Run against known-good data and configuration

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

Overview

Common Failure Modes

1. Missing or Malformed Input Columns

2. Memory Limit Exceeded

3. ONNX Export Failures

4. Non-Reproducible Results

5. Benchmark Variance Too High

6. Alert Threshold Drift

Diagnostic Strategies

Enable Verbose Logging

Checkpoint Intermediate Results

Validate Data at Each Stage

Profile Performance Bottlenecks

Emergency Recovery

Restore from Checkpoints

Safe Mode Execution

Getting Help

See Also

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

​Overview

​Common Failure Modes

​1. Missing or Malformed Input Columns

​2. Memory Limit Exceeded

​3. ONNX Export Failures

​4. Non-Reproducible Results

​5. Benchmark Variance Too High

​6. Alert Threshold Drift

​Diagnostic Strategies

​Enable Verbose Logging

​Checkpoint Intermediate Results

​Validate Data at Each Stage

​Profile Performance Bottlenecks

​Emergency Recovery

​Restore from Checkpoints

​Safe Mode Execution

​Getting Help

​See Also

Build docs developers (and LLMs) love

Overview

Common Failure Modes

1. Missing or Malformed Input Columns

2. Memory Limit Exceeded

3. ONNX Export Failures

4. Non-Reproducible Results

5. Benchmark Variance Too High

6. Alert Threshold Drift

Diagnostic Strategies

Enable Verbose Logging

Checkpoint Intermediate Results

Validate Data at Each Stage

Profile Performance Bottlenecks

Emergency Recovery

Restore from Checkpoints

Safe Mode Execution

Getting Help

See Also