Overview
The AI Data Science Service follows a modular architecture that separates concerns across data, research, infrastructure, and production code. This structure enables teams to work independently while maintaining integration points for the complete ML lifecycle.Design Philosophy: Extreme modularity where business logic, training, and inference live in separate, decoupled layers.
Repository Architecture
High-Level Structure
The repository is organized into four main domains:datasets/
Version-controlled data indexes with DVC
container-images/
Production-ready Docker configurations
notebooks-analysis/
Exploratory data analysis and research
python-projects/
Production-grade applications and services
Directory Breakdown
1. datasets/ - Data Management
“The Source of Truth” This directory acts as an intelligent index for datasets, not a storage location for raw data files.- DVC Integration: Stores
.dvcmetadata files that point to remote storage - Version Control: Tracks exact dataset versions used in experiments
- Efficient Downloads: Team members pull only required data versions
- Remote Storage: Actual data stored in S3, DagsHub, or Azure Blob
DVC files are small text files (~100 bytes) containing checksums and references. The actual datasets remain in cloud storage, keeping the Git repository lightweight.
2. container-images/ - Infrastructure
“Ready for Liftoff” Contains immutable infrastructure definitions for production deployments. Contents:- Base Dockerfiles for different environments
- Optimized production configurations
- Multi-stage build definitions
- Security hardening configurations
- “Works on my machine” → “Works in production”
- Consistent environments across development and deployment
- Reproducible builds with version pinning
3. notebooks-analysis/ - Research Lab
“The Laboratory of Ideas” Space for creativity, exploration, and statistical analysis. Contents:- Jupyter notebooks for Exploratory Data Analysis (EDA)
- Rapid prototyping experiments
- PDF exports for stakeholder communication
- Visualization and insight generation
- Keep notebooks focused on exploration
- Export production code to
python-projects/ - Version control notebooks with outputs cleared
- Generate PDF reports for non-technical stakeholders
4. python-projects/ - Production Engine
“Where Code Becomes Professional” Structured applications following software engineering best practices.Credit Score Project Structure
Complete Directory Tree
Module Responsibilities
config/ - Configuration Management
config/ - Configuration Management
Centralizes all configuration files for reproducibility and easy experimentation.logs_configs/models-configs/
logging_config.py
- Hyperparameter versioning
- A/B testing configurations
- Production vs. experimental settings
Configuration files enable changing model behavior without code changes, crucial for MLOps workflows.
model/ - Neural Network Architecture
model/ - Neural Network Architecture
Defines the PyTorch model architecture and configuration classes.Features:
model/model.py
- Dynamic architecture from configuration
- Configurable activation functions
- Dropout for regularization
- Probability prediction methods
processing/ - Data Pipeline
processing/ - Data Pipeline
Handles all data preprocessing and feature engineering.Saved Artifacts:
processing/preprocessor.py
preprocessor.joblib: Fitted sklearn pipeline- Ensures training/inference consistency
training/ - Training Orchestration
training/ - Training Orchestration
Orchestrates the complete training workflow with MLflow integration.Execution:
training/training.py
inference/ - Prediction Service
inference/ - Prediction Service
Singleton-based inference engine for efficient prediction serving.Benefits:
inference/inference.py
- Model loaded once at startup
- Minimal inference latency
- Memory efficient
server/ - API Layer
server/ - API Layer
FastAPI-based REST API for model serving.Features:
server/api.py
- Automatic OpenAPI documentation at
/docs - CORS middleware for web clients
- Pydantic validation for input/output
- Async request handling
examples/ - Client Applications
examples/ - Client Applications
Demonstration interfaces for the ML service.client_web/
- Interactive web interface
- Real-time predictions
- User-friendly input forms
- Visualization of results
- Stakeholder demonstrations
- User acceptance testing
- Integration examples
- API usage documentation
Technology Stack Justification
PyTorch
Why: Dynamic computation graphs, easy debugging, excellent for researchUsage: Neural network architecture in
model/FastAPI
Why: High performance, async support, automatic documentationUsage: REST API in
server/Pydantic
Why: Runtime type validation, prevents garbage inputUsage: Data schemas in
server/schemas.pyUV
Why: Ultra-fast dependency management, deterministic installsUsage: Environment management via
uv.lockMLflow
Why: Experiment tracking, model versioning, artifact managementUsage: MLOps backbone in
training/Scikit-learn
Why: Production-ready preprocessing pipelines, serializableUsage: Feature engineering in
processing/Docker
Why: Environment consistency, deployment portabilityUsage: Containerization via
Dockerfile.*DVC
Why: Data version control, large file managementUsage: Dataset versioning in
datasets/Best Practices
Separation of Concerns
Each module has a single responsibility:
model/knows about neural architectureprocessing/handles data transformationtraining/orchestrates experimentsinference/serves predictionsserver/exposes HTTP endpoints
Import Patterns
Configuration Over Code
Quick Start Guide
Next Steps
MLOps Architecture
Learn about experiment tracking and CI/CD
Data Versioning
Understand DVC and data management
