Overview
The training script trains two ML pipelines:- Category Classifier: Predicts support ticket categories (e.g., billing, technical, account)
- Priority Classifier: Predicts ticket priority levels (e.g., low, medium, high, urgent)
Prerequisites
- Training data:
tickets_train.csvin the project root - Required columns:
subject,body,category_label,priority_label - Python 3.12+ with dependencies installed via
uv sync
Training Command
Prepare training data
Ensure
tickets_train.csv exists in your project root with the required columns.Training Process
The training pipeline performs the following steps (seesrc/ml/train.py:242):
- Load Dataset: Reads
tickets_train.csvand normalizes text - Train/Val Split: Creates 80/20 split with stratification
- Feature Extraction: Builds TF-IDF features from subject and body
- Model Training: Fits logistic regression with balanced class weights
- Validation: Computes metrics on held-out validation set
- Artifact Saving: Persists models and reports
Output Artifacts
After training completes, the following artifacts are generated:Model Files (artifacts/)
category_model.joblib: Trained category classifier pipelinepriority_model.joblib: Trained priority classifier pipeline
Reports (reports/)
val_metrics.json: Validation metrics in JSON formatcategory_confusion_matrix.png: Confusion matrix visualization for categoriespriority_confusion_matrix.png: Confusion matrix visualization for priorities
Validation Metrics
The training script computes three key metrics (seesrc/ml/train.py:174):
| Metric | Description |
|---|---|
category_macro_f1 | Macro-averaged F1 score for category classification |
priority_f1 | Weighted F1 score for priority classification |
priority_recall | Weighted recall for priority classification |
reports/val_metrics.json:
Confusion Matrices
Confusion matrices are saved as PNG images in thereports/ directory. These visualizations help identify:
- Which categories are frequently confused
- Priority levels that are difficult to distinguish
- Class imbalance issues
- Blues colormap for categories
- Oranges colormap for priorities
Edge Case Handling
The training script includes robust handling for edge cases (seesrc/ml/train.py:150):
- Single-class datasets: Falls back to
ConstantPredictorwhen only one class exists - Small datasets: Uses non-stratified split when stratification is impossible
- Class imbalance: Applies
class_weight="balanced"in logistic regression
Programmatic Usage
You can call the training function programmatically:Troubleshooting
FileNotFoundError: tickets_train.csv not found
FileNotFoundError: tickets_train.csv not found
Ensure
tickets_train.csv exists in the project root or provide a custom path via the train_csv parameter.ValueError: Dataset too small for stratified split
ValueError: Dataset too small for stratified split
This warning appears when you have very few samples per class. The training will continue with a non-stratified split.
Models predict constant values
Models predict constant values
Check if your training data has only one class. The system will use a
ConstantPredictor fallback in this case.