Skip to main content

Overview

This guide walks you through training your first agent with PufferLib. You’ll learn how to:
  • Use the command-line interface for instant training
  • Load and configure environments
  • Customize training with the Python API
  • Build custom policies
By the end, you’ll have a working agent trained on Breakout!

Train via command line

The fastest way to start training is with the puffer CLI:
puffer train puffer_breakout
That’s it! PufferLib will:
  1. Load the Breakout environment from the Ocean collection
  2. Initialize a default CNN policy
  3. Start training with PPO
  4. Display a live dashboard with metrics
  5. Save checkpoints to checkpoints/
Training runs until you stop it with Ctrl+C. Checkpoints are saved periodically so you can resume later.

Customize training parameters

Override defaults with command-line flags:
puffer train puffer_breakout \
  --train.total-timesteps 10000000 \
  --train.learning-rate 0.0003 \
  --env.num-envs 4096 \
  --vec.num-workers 8
See all options:
puffer train --help

Train with Python API

Simple training script

For more control, use the Python API. Here’s a minimal training loop:
import pufferlib.ocean
import pufferlib.vector
from pufferlib import pufferl

def simple_trainer(env_name='puffer_breakout'):
    # Load default config for the environment
    args = pufferl.load_config(env_name)

    # Customize configuration
    args['vec']['num_envs'] = 2
    args['env']['num_envs'] = 2048
    args['policy']['hidden_size'] = 256
    args['train']['total_timesteps'] = 10_000_000
    args['train']['learning_rate'] = 0.03

    # Create vectorized environment
    vecenv = pufferl.load_env(env_name, args)
    
    # Create policy
    policy = pufferl.load_policy(args, vecenv, env_name)

    # Initialize trainer
    trainer = pufferl.PuffeRL(args['train'], vecenv, policy)

    # Training loop
    while trainer.epoch < trainer.total_epochs:
        trainer.evaluate()
        logs = trainer.train()

    trainer.print_dashboard()
    trainer.close()

if __name__ == '__main__':
    simple_trainer()
1

Load configuration

pufferl.load_config() loads environment-specific defaults. You can customize any parameter:
args = pufferl.load_config('puffer_breakout')
args['train']['learning_rate'] = 0.001
args['train']['gamma'] = 0.99
args['train']['gae_lambda'] = 0.95
2

Create vectorized environment

pufferl.load_env() creates a vectorized environment with parallel workers:
vecenv = pufferl.load_env(env_name, args)
This spawns multiple environment processes for faster data collection.
3

Initialize policy

pufferl.load_policy() creates a neural network policy:
policy = pufferl.load_policy(args, vecenv, env_name)
The default policy is a CNN for image observations or MLP for vector observations.
4

Create trainer

PuffeRL manages the training loop, checkpoints, and logging:
trainer = pufferl.PuffeRL(args['train'], vecenv, policy)
5

Train

Call train() and evaluate() in a loop:
while trainer.epoch < trainer.total_epochs:
    trainer.evaluate()  # Run evaluation episodes
    logs = trainer.train()  # Train for one epoch

Custom policy

Define your own policy architecture:
import torch
import pufferlib.pytorch

class CustomPolicy(torch.nn.Module):
    def __init__(self, env):
        super().__init__()
        obs_size = env.single_observation_space.shape[0]
        num_actions = env.single_action_space.n
        
        self.net = torch.nn.Sequential(
            pufferlib.pytorch.layer_init(torch.nn.Linear(obs_size, 128)),
            torch.nn.ReLU(),
            pufferlib.pytorch.layer_init(torch.nn.Linear(128, 128)),
            torch.nn.ReLU(),
        )
        self.action_head = torch.nn.Linear(128, num_actions)
        self.value_head = torch.nn.Linear(128, 1)

    def forward(self, observations, state=None):
        hidden = self.net(observations)
        logits = self.action_head(hidden)
        values = self.value_head(hidden)
        return logits, values
Use your custom policy:
env_name = 'puffer_breakout'
env_creator = pufferlib.ocean.env_creator(env_name)
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=2,
    num_workers=2,
    batch_size=1,
    backend=pufferlib.vector.Multiprocessing,
    env_kwargs={'num_envs': 4096}
)

policy = CustomPolicy(vecenv.driver_env).cuda()
args = pufferl.load_config('default')
args['train']['env'] = env_name

trainer = pufferl.PuffeRL(args['train'], vecenv, policy)

for epoch in range(10):
    trainer.evaluate()
    logs = trainer.train()

trainer.print_dashboard()
trainer.close()
Your policy must implement forward() and return (logits, values). The state parameter is for recurrent policies (can be None).

Vectorization

PufferLib uses vectorization to run multiple environments in parallel. Here’s how to configure it:
import gymnasium
import pufferlib.vector

# Create environment factory
def make_env():
    return gymnasium.make('CartPole-v1')

# Serial backend (single-threaded, for debugging)
vecenv = pufferlib.vector.make(
    make_env,
    num_envs=4,
    backend=pufferlib.vector.Serial
)

# Multiprocessing backend (parallel workers)
vecenv = pufferlib.vector.make(
    make_env,
    num_envs=8,
    num_workers=4,
    batch_size=2,
    backend=pufferlib.vector.Multiprocessing
)

Key parameters

  • num_envs: Total number of environments
  • num_workers: Number of parallel processes
  • batch_size: Number of environments processed per step
  • backend: Serial or Multiprocessing
Ensure num_envs divides num_workers, and both divide batch_size.

Using Gymnasium environments

Wrap any Gymnasium environment:
import gymnasium
import pufferlib.emulation
import pufferlib.vector
from pufferlib import pufferl

# Create Gymnasium environment
gym_env = gymnasium.make('CartPole-v1')

# Wrap as PufferEnv
puffer_env = pufferlib.emulation.GymnasiumPufferEnv(gym_env)

# Use with vectorization
def make_env():
    env = gymnasium.make('CartPole-v1')
    return pufferlib.emulation.GymnasiumPufferEnv(env)

vecenv = pufferlib.vector.make(
    make_env,
    num_envs=8,
    num_workers=4,
    backend=pufferlib.vector.Multiprocessing
)

# Train with PufferRL
args = pufferl.load_config('default')
policy = pufferl.load_policy(args, vecenv)
trainer = pufferl.PuffeRL(args['train'], vecenv, policy)

for epoch in range(100):
    trainer.evaluate()
    trainer.train()

trainer.close()

Evaluation

Evaluate a trained policy:
puffer eval puffer_breakout --policy checkpoints/latest.pt
Or in Python:
import torch
import pufferlib.ocean
import pufferlib.vector

# Load environment
env_creator = pufferlib.ocean.env_creator('puffer_breakout')
vecenv = pufferlib.vector.make(env_creator, num_envs=1)

# Load policy
policy = torch.load('checkpoints/latest.pt')
policy.eval()

# Run episodes
obs, _ = vecenv.reset()
for _ in range(1000):
    with torch.no_grad():
        logits, _ = policy(obs)
        actions = torch.argmax(logits, dim=-1)
    obs, rewards, dones, truncs, infos = vecenv.step(actions)
    
vecenv.close()

Distributed training

Scale training across multiple GPUs:
torchrun --standalone --nnodes=1 --nproc-per-node=4 \
  -m pufferlib.pufferl train puffer_breakout
This spawns 4 processes, each with its own GPU.
PufferRL uses PyTorch DDP for distributed training. Gradients are synchronized across all processes.

Next steps

Core concepts

Understand PufferLib’s architecture

Training guide

Deep dive into PufferRL

Environment wrappers

Work with different environment types

API reference

Explore the full API

Example projects

Check out complete examples:

Get help

Stuck? We’re here to help:

Build docs developers (and LLMs) love