Overview
The AI Data Science Service implements a comprehensive MLOps architecture that ensures reproducibility, traceability, and production-readiness for machine learning models. This architecture bridges the gap between data science experimentation and production deployment.MLOps Philosophy: “The difference between a notebook and a product is the engineering.”This architecture demonstrates how to structure data science projects following industry best practices, breaking the barrier between exploratory analysis and production software.
Core Components
Experiment Tracking
MLflow for tracking metrics, parameters, and artifacts
Model Versioning
Systematic versioning of models and configurations
CI/CD Integration
Automated pipelines for testing and deployment
Reproducibility
Deterministic environments and data versioning
Experiment Tracking
MLflow Integration
The service uses MLflow to track all aspects of model training, enabling complete experiment reproducibility and comparison.training/training.py
Tracked Metrics
The architecture tracks comprehensive metrics at different stages:Training Metrics
Training Metrics
- train_loss: Binary cross-entropy loss per epoch
- train_accuracy: Training accuracy per epoch
- batch_size: Number of samples per batch
- learning_rate: Optimizer learning rate
Evaluation Metrics
Evaluation Metrics
- test_accuracy: Overall model accuracy on test set
- test_roc_auc: Area under ROC curve
- test_precision: Positive predictive value
- test_recall: Sensitivity/true positive rate
- test_f1_score: Harmonic mean of precision and recall
Visual Artifacts
Visual Artifacts
- Confusion Matrix: Classification performance heatmap
- ROC Curve: True vs false positive rate visualization
- Precision-Recall Curve: Trade-off visualization
- Classification Report: Detailed per-class metrics
Accessing MLflow UI
Start the MLflow tracking server to visualize experiments:The MLflow UI provides real-time visualization of training metrics, model comparisons, and artifact browsing. All experiments are stored in the
mlruns/ directory.Model Versioning
Configuration-Based Versioning
Models are versioned through YAML configuration files, enabling systematic experimentation:config/models-configs/model_config_001.yaml
model_config_000.yaml- Baseline configurationmodel_config_001.yaml- Production configurationmodel_config_002.yaml- Experimental variants
Model Artifacts
Each training run produces versioned artifacts:CI/CD Integration
Training Pipeline
The architecture supports automated training pipelines:Docker-Based Deployment
Production deployment uses containerized environments:docker-compose.yml
Continuous Integration
- Automated testing on code changes
- Linting and type checking
- Training smoke tests
Continuous Deployment
- Containerized deployments
- Blue-green deployment strategy
- Automated rollback capabilities
Reproducibility
Environment Management
The service ensures deterministic environments:pyproject.toml
- UV Lock File:
uv.lockensures exact dependency versions - Python Version:
.python-versionpins Python runtime - Random Seeds:
random_state=42for consistent data splits - Docker Images: Immutable runtime environments
Data Versioning
Integration with DVC (Data Version Control) for dataset versioning:Dataset versions are tracked via DVC, with
.dvc files stored in Git and actual data in remote storage (S3, DagsHub, Azure Blob).Inference Architecture
Singleton Predictor Pattern
The inference system uses a singleton pattern for efficient model loading:inference/inference.py
- Model loaded once at startup
- Reduced inference latency
- Memory efficient for high-throughput scenarios
Best Practices
Experiment Organization
- Use descriptive experiment names
- Tag runs with metadata (dataset version, git commit)
- Archive failed experiments for learning
Model Selection
- Define success metrics upfront
- Compare models systematically via MLflow
- Document model selection rationale
Artifact Management
- Log all training artifacts
- Version control configurations
- Maintain artifact lineage
Monitoring
- Track inference latency
- Monitor prediction distributions
- Alert on model degradation
Next Steps
Project Structure
Explore the modular project organization
Data Versioning
Learn about DVC and data management
