Run a training job on your model. This command is similar to cog predict but specifically for models that support fine-tuning.
This command is currently in beta. The training interface may change in future versions.
Usage
cog train [image] [flags]
If image is provided, it runs the training on that Docker image (must be a Cog-built image with training support). Otherwise, it builds the model in the current directory and runs the training.
Flags
Training inputs in the form name=value. If value is prefixed with @, it’s read from a file. Can be specified multiple times.cog train -i dataset=@training_data.zip -i steps=1000
Output path for trained weights
Environment variables in the form name=value. Can be specified multiple times.
The name of the config file
GPU devices to add to the container, in the same format as docker run --gpus
Set type of build progress output: auto, tty, plain, or quiet
Use pre-built Cog base image for faster cold boots
Use Nvidia CUDA base image: true, false, or auto
Training Interface
To support training, your model must implement the train method in addition to predict.
Example predictor with training
from cog import BasePredictor, Input, Path
import torch
class Predictor(BasePredictor):
def setup(self):
"""Load the model"""
self.model = load_model()
def predict(
self,
image: Path = Input(description="Input image")
) -> Path:
"""Run a prediction"""
return self.model(image)
def train(
self,
dataset: Path = Input(description="Training dataset (zip file)"),
steps: int = Input(description="Number of training steps", default=1000),
learning_rate: float = Input(description="Learning rate", default=1e-4)
) -> Path:
"""Fine-tune the model"""
# Unzip and load training data
train_data = load_dataset(dataset)
# Configure training
optimizer = torch.optim.Adam(self.model.parameters(), lr=learning_rate)
# Training loop
for step in range(steps):
loss = train_step(self.model, train_data, optimizer)
if step % 100 == 0:
print(f"Step {step}: loss = {loss}")
# Save trained weights
output_path = "/tmp/trained_weights.pth"
torch.save(self.model.state_dict(), output_path)
return Path(output_path)
Configuration
Update cog.yaml to specify the training interface:
build:
gpu: true
python_version: "3.12"
python_requirements: requirements.txt
predict: "predict.py:Predictor"
train: "predict.py:Predictor" # Can be the same class
Examples
Basic training run
cog train -i dataset=@training_data.zip -i steps=1000
Output:
Building Docker image from environment in cog.yaml...
[+] Building 2.3s (12/12) FINISHED
Starting Docker image and running setup()...
Running training...
Step 0: loss = 2.456
Step 100: loss = 1.234
Step 200: loss = 0.892
...
Step 1000: loss = 0.234
Written output to: weights/trained_weights.pth
Training with custom output path
Saves the trained weights to ./models/v2/.
Training with environment variables
Track training with Weights & Biases:
cog train \
-e WANDB_API_KEY=$WANDB_API_KEY \
-e WANDB_PROJECT=my-project \
-i [email protected] \
-i steps=5000
Training on pre-built image
Training with GPU control
Training inputs work the same as prediction inputs:
Files
Strings
Numbers
Booleans
Output Handling
The train method should return a Path to the trained weights:
def train(self, dataset: Path, steps: int = 1000) -> Path:
# ... training code ...
output_path = "/tmp/weights.pth"
torch.save(self.model.state_dict(), output_path)
return Path(output_path)
Cog automatically:
- Receives the returned path
- Copies the file from the container
- Saves it to the specified output location
Output:
Written output to: ./my_weights.pth
Training Patterns
Progress reporting
Use print() to show training progress:
def train(self, dataset: Path, steps: int = 1000) -> Path:
for step in range(steps):
loss = train_step(...)
# Print progress every 100 steps
if step % 100 == 0:
print(f"Step {step}/{steps}: loss = {loss:.4f}")
return Path(output_path)
Checkpoint saving
Save intermediate checkpoints:
def train(self, dataset: Path, steps: int = 1000) -> Path:
for step in range(steps):
loss = train_step(...)
# Save checkpoint every 1000 steps
if step % 1000 == 0:
checkpoint_path = f"/tmp/checkpoint_{step}.pth"
torch.save(self.model.state_dict(), checkpoint_path)
print(f"Saved checkpoint: {checkpoint_path}")
# Return final weights
final_path = "/tmp/final_weights.pth"
torch.save(self.model.state_dict(), final_path)
return Path(final_path)
Logging and metrics
Integrate with experiment tracking:
import os
import wandb
def train(self, dataset: Path, steps: int = 1000) -> Path:
# Initialize W&B from environment variables
wandb.init(project=os.environ.get("WANDB_PROJECT", "default"))
for step in range(steps):
loss = train_step(...)
# Log metrics
wandb.log({"loss": loss, "step": step})
print(f"Step {step}: loss = {loss:.4f}")
return Path(output_path)
Run with:
cog train -e WANDB_API_KEY=$WANDB_API_KEY -e WANDB_PROJECT=my-project -i [email protected]
Validation during training
def train(
self,
train_data: Path,
val_data: Path,
steps: int = 1000
) -> Path:
train_dataset = load_dataset(train_data)
val_dataset = load_dataset(val_data)
for step in range(steps):
# Training step
train_loss = train_step(self.model, train_dataset)
# Validation every 100 steps
if step % 100 == 0:
val_loss = validate(self.model, val_dataset)
print(f"Step {step}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")
return Path(output_path)
GPU Requirements
Training typically requires GPUs. Configure in cog.yaml:
build:
gpu: true
cuda: "12.1"
python_version: "3.12"
python_requirements: requirements.txt
train: "predict.py:Predictor"
Cog automatically:
- Uses CUDA base images
- Installs GPU-enabled packages
- Passes
--gpus all to the container
Error Handling
Training failures
When training fails, Cog exits with a non-zero status:
Out of memory
RuntimeError: CUDA out of memory
Solutions:
- Reduce batch size
- Use gradient accumulation
- Use a smaller model
- Use more GPUs
Invalid dataset
Error: Failed to load dataset from data.zip
Validate your dataset format matches what your training code expects.
How It Works
-
Build phase (if no image specified):
- Reads
cog.yaml
- Builds Docker image with training dependencies
- Mounts current directory as a volume
-
Setup phase:
- Starts Docker container with GPU access
- Runs your model’s
setup() method
- Loads model weights
-
Training phase:
- Validates training inputs
- Runs your model’s
train() method
- Streams logs to terminal
-
Output phase:
- Receives trained weights path
- Copies weights from container
- Saves to specified output location
-
Cleanup:
- Stops container
- Removes temporary resources
Comparison with Predict
| Feature | cog predict | cog train |
|---|
| Method called | predict() | train() |
| Default output | output | weights |
| Typical duration | Seconds to minutes | Minutes to hours |
| GPU usage | Optional | Usually required |
| Return type | Any | Usually Path |
See Also