cog train

Run a training job on your model. This command is similar to cog predict but specifically for models that support fine-tuning.

This command is currently in beta. The training interface may change in future versions.

Usage

cog train [image] [flags]

If image is provided, it runs the training on that Docker image (must be a Cog-built image with training support). Otherwise, it builds the model in the current directory and runs the training.

Flags

-i, --input

string[]

Training inputs in the form name=value. If value is prefixed with @, it’s read from a file. Can be specified multiple times.

cog train -i dataset=@training_data.zip -i steps=1000

-o, --output

string

default:"weights"

Output path for trained weights

cog train -i [email protected] -o ./trained_model

-e, --env

string[]

Environment variables in the form name=value. Can be specified multiple times.

cog train -e WANDB_API_KEY=abc123 -i [email protected]

-f, --file

string

default:"cog.yaml"

The name of the config file

cog train -f custom-config.yaml -i [email protected]

--gpus

string

GPU devices to add to the container, in the same format as docker run --gpus

cog train --gpus all -i [email protected]

--progress

string

default:"auto"

Set type of build progress output: auto, tty, plain, or quiet

--use-cog-base-image

boolean

default:"true"

Use pre-built Cog base image for faster cold boots

--use-cuda-base-image

string

default:"auto"

Use Nvidia CUDA base image: true, false, or auto

Training Interface

To support training, your model must implement the train method in addition to predict.

Example predictor with training

from cog import BasePredictor, Input, Path
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model"""
        self.model = load_model()
    
    def predict(
        self,
        image: Path = Input(description="Input image")
    ) -> Path:
        """Run a prediction"""
        return self.model(image)
    
    def train(
        self,
        dataset: Path = Input(description="Training dataset (zip file)"),
        steps: int = Input(description="Number of training steps", default=1000),
        learning_rate: float = Input(description="Learning rate", default=1e-4)
    ) -> Path:
        """Fine-tune the model"""
        # Unzip and load training data
        train_data = load_dataset(dataset)
        
        # Configure training
        optimizer = torch.optim.Adam(self.model.parameters(), lr=learning_rate)
        
        # Training loop
        for step in range(steps):
            loss = train_step(self.model, train_data, optimizer)
            if step % 100 == 0:
                print(f"Step {step}: loss = {loss}")
        
        # Save trained weights
        output_path = "/tmp/trained_weights.pth"
        torch.save(self.model.state_dict(), output_path)
        
        return Path(output_path)

Configuration

Update cog.yaml to specify the training interface:

build:
  gpu: true
  python_version: "3.12"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"
train: "predict.py:Predictor"  # Can be the same class

Examples

Basic training run

cog train -i dataset=@training_data.zip -i steps=1000

Output:

Building Docker image from environment in cog.yaml...

[+] Building 2.3s (12/12) FINISHED

Starting Docker image and running setup()...

Running training...

Step 0: loss = 2.456
Step 100: loss = 1.234
Step 200: loss = 0.892
...
Step 1000: loss = 0.234

Written output to: weights/trained_weights.pth

Training with custom output path

cog train -i [email protected] -i steps=2000 -o ./models/v2

Saves the trained weights to ./models/v2/.

Training with environment variables

Track training with Weights & Biases:

cog train \
  -e WANDB_API_KEY=$WANDB_API_KEY \
  -e WANDB_PROJECT=my-project \
  -i [email protected] \
  -i steps=5000

Training on pre-built image

cog train my-model:latest -i [email protected] -i steps=1000

Training with multiple inputs

cog train \
  -i [email protected] \
  -i [email protected] \
  -i epochs=10 \
  -i batch_size=32 \
  -i learning_rate=0.001

Training with GPU control

# Use all GPUs
cog train --gpus all -i [email protected]

# Use specific GPUs
cog train --gpus '"device=0,1"' -i [email protected]

Input Types

Training inputs work the same as prediction inputs:

Files

cog train -i [email protected] -i [email protected]

Strings

cog train -i [email protected] -i model_name="my-model-v1"

Numbers

cog train -i [email protected] -i steps=1000 -i learning_rate=0.001

Booleans

cog train -i [email protected] -i use_augmentation=true

Output Handling

The train method should return a Path to the trained weights:

def train(self, dataset: Path, steps: int = 1000) -> Path:
    # ... training code ...
    
    output_path = "/tmp/weights.pth"
    torch.save(self.model.state_dict(), output_path)
    return Path(output_path)

Cog automatically:

Receives the returned path
Copies the file from the container
Saves it to the specified output location

cog train -i [email protected] -o ./my_weights.pth

Output:

Written output to: ./my_weights.pth

Training Patterns

Progress reporting

Use print() to show training progress:

def train(self, dataset: Path, steps: int = 1000) -> Path:
    for step in range(steps):
        loss = train_step(...)
        
        # Print progress every 100 steps
        if step % 100 == 0:
            print(f"Step {step}/{steps}: loss = {loss:.4f}")
    
    return Path(output_path)

Checkpoint saving

Save intermediate checkpoints:

def train(self, dataset: Path, steps: int = 1000) -> Path:
    for step in range(steps):
        loss = train_step(...)
        
        # Save checkpoint every 1000 steps
        if step % 1000 == 0:
            checkpoint_path = f"/tmp/checkpoint_{step}.pth"
            torch.save(self.model.state_dict(), checkpoint_path)
            print(f"Saved checkpoint: {checkpoint_path}")
    
    # Return final weights
    final_path = "/tmp/final_weights.pth"
    torch.save(self.model.state_dict(), final_path)
    return Path(final_path)

Logging and metrics

Integrate with experiment tracking:

import os
import wandb

def train(self, dataset: Path, steps: int = 1000) -> Path:
    # Initialize W&B from environment variables
    wandb.init(project=os.environ.get("WANDB_PROJECT", "default"))
    
    for step in range(steps):
        loss = train_step(...)
        
        # Log metrics
        wandb.log({"loss": loss, "step": step})
        print(f"Step {step}: loss = {loss:.4f}")
    
    return Path(output_path)

Run with:

cog train -e WANDB_API_KEY=$WANDB_API_KEY -e WANDB_PROJECT=my-project -i [email protected]

Validation during training

def train(
    self,
    train_data: Path,
    val_data: Path,
    steps: int = 1000
) -> Path:
    train_dataset = load_dataset(train_data)
    val_dataset = load_dataset(val_data)
    
    for step in range(steps):
        # Training step
        train_loss = train_step(self.model, train_dataset)
        
        # Validation every 100 steps
        if step % 100 == 0:
            val_loss = validate(self.model, val_dataset)
            print(f"Step {step}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}")
    
    return Path(output_path)

GPU Requirements

Training typically requires GPUs. Configure in cog.yaml:

build:
  gpu: true
  cuda: "12.1"
  python_version: "3.12"
  python_requirements: requirements.txt
train: "predict.py:Predictor"

Cog automatically:

Uses CUDA base images
Installs GPU-enabled packages
Passes --gpus all to the container

Error Handling

Training failures

When training fails, Cog exits with a non-zero status:

cog train -i [email protected]
# Exit code: 1

Out of memory

RuntimeError: CUDA out of memory

Solutions:

Reduce batch size
Use gradient accumulation
Use a smaller model
Use more GPUs

Invalid dataset

Error: Failed to load dataset from data.zip

Validate your dataset format matches what your training code expects.

How It Works

Build phase (if no image specified):
- Reads cog.yaml
- Builds Docker image with training dependencies
- Mounts current directory as a volume
Setup phase:
- Starts Docker container with GPU access
- Runs your model’s setup() method
- Loads model weights
Training phase:
- Validates training inputs
- Runs your model’s train() method
- Streams logs to terminal
Output phase:
- Receives trained weights path
- Copies weights from container
- Saves to specified output location
Cleanup:
- Stops container
- Removes temporary resources

Comparison with Predict

Feature	`cog predict`	`cog train`
Method called	`predict()`	`train()`
Default output	`output`	`weights`
Typical duration	Seconds to minutes	Minutes to hours
GPU usage	Optional	Usually required
Return type	Any	Usually `Path`

CLI Commands

Python SDK

HTTP API

Redis Queue

cog train

Usage

Flags

Training Interface

Example predictor with training

Configuration

Examples

Basic training run

Training with custom output path

Training with environment variables

Training on pre-built image

Training with multiple inputs

Training with GPU control

Input Types

Files

Strings

Numbers

Booleans

Output Handling

Training Patterns

Progress reporting

Checkpoint saving

Logging and metrics

Validation during training

GPU Requirements

Error Handling

Training failures

Out of memory

Invalid dataset

How It Works

Comparison with Predict

See Also

Build docs developers (and LLMs) love

CLI Commands

Python SDK

HTTP API

Redis Queue

​Usage

​Flags

​Training Interface

​Example predictor with training

​Configuration

​Examples

​Basic training run

​Training with custom output path

​Training with environment variables

​Training on pre-built image

​Training with multiple inputs

​Training with GPU control

​Input Types

​Files

​Strings

​Numbers

​Booleans

​Output Handling

​Training Patterns

​Progress reporting

​Checkpoint saving

​Logging and metrics

​Validation during training

​GPU Requirements

​Error Handling

​Training failures

​Out of memory

​Invalid dataset

​How It Works

​Comparison with Predict

​See Also

Build docs developers (and LLMs) love

Usage

Flags

Training Interface

Example predictor with training

Configuration

Examples

Basic training run

Training with custom output path

Training with environment variables

Training on pre-built image

Training with multiple inputs

Training with GPU control

Input Types

Files

Strings

Numbers

Booleans

Output Handling

Training Patterns

Progress reporting

Checkpoint saving

Logging and metrics

Validation during training

GPU Requirements

Error Handling

Training failures

Out of memory

Invalid dataset

How It Works

Comparison with Predict

See Also