Overview
This guide walks you through training your first agent with PufferLib. You’ll learn how to:
- Use the command-line interface for instant training
- Load and configure environments
- Customize training with the Python API
- Build custom policies
By the end, you’ll have a working agent trained on Breakout!
Train via command line
The fastest way to start training is with the puffer CLI:
puffer train puffer_breakout
That’s it! PufferLib will:
- Load the Breakout environment from the Ocean collection
- Initialize a default CNN policy
- Start training with PPO
- Display a live dashboard with metrics
- Save checkpoints to
checkpoints/
Training runs until you stop it with Ctrl+C. Checkpoints are saved periodically so you can resume later.
Customize training parameters
Override defaults with command-line flags:
puffer train puffer_breakout \
--train.total-timesteps 10000000 \
--train.learning-rate 0.0003 \
--env.num-envs 4096 \
--vec.num-workers 8
See all options:
Train with Python API
Simple training script
For more control, use the Python API. Here’s a minimal training loop:
import pufferlib.ocean
import pufferlib.vector
from pufferlib import pufferl
def simple_trainer(env_name='puffer_breakout'):
# Load default config for the environment
args = pufferl.load_config(env_name)
# Customize configuration
args['vec']['num_envs'] = 2
args['env']['num_envs'] = 2048
args['policy']['hidden_size'] = 256
args['train']['total_timesteps'] = 10_000_000
args['train']['learning_rate'] = 0.03
# Create vectorized environment
vecenv = pufferl.load_env(env_name, args)
# Create policy
policy = pufferl.load_policy(args, vecenv, env_name)
# Initialize trainer
trainer = pufferl.PuffeRL(args['train'], vecenv, policy)
# Training loop
while trainer.epoch < trainer.total_epochs:
trainer.evaluate()
logs = trainer.train()
trainer.print_dashboard()
trainer.close()
if __name__ == '__main__':
simple_trainer()
Load configuration
pufferl.load_config() loads environment-specific defaults. You can customize any parameter:args = pufferl.load_config('puffer_breakout')
args['train']['learning_rate'] = 0.001
args['train']['gamma'] = 0.99
args['train']['gae_lambda'] = 0.95
Create vectorized environment
pufferl.load_env() creates a vectorized environment with parallel workers:vecenv = pufferl.load_env(env_name, args)
This spawns multiple environment processes for faster data collection.Initialize policy
pufferl.load_policy() creates a neural network policy:policy = pufferl.load_policy(args, vecenv, env_name)
The default policy is a CNN for image observations or MLP for vector observations.Create trainer
PuffeRL manages the training loop, checkpoints, and logging:trainer = pufferl.PuffeRL(args['train'], vecenv, policy)
Train
Call train() and evaluate() in a loop:while trainer.epoch < trainer.total_epochs:
trainer.evaluate() # Run evaluation episodes
logs = trainer.train() # Train for one epoch
Custom policy
Define your own policy architecture:
import torch
import pufferlib.pytorch
class CustomPolicy(torch.nn.Module):
def __init__(self, env):
super().__init__()
obs_size = env.single_observation_space.shape[0]
num_actions = env.single_action_space.n
self.net = torch.nn.Sequential(
pufferlib.pytorch.layer_init(torch.nn.Linear(obs_size, 128)),
torch.nn.ReLU(),
pufferlib.pytorch.layer_init(torch.nn.Linear(128, 128)),
torch.nn.ReLU(),
)
self.action_head = torch.nn.Linear(128, num_actions)
self.value_head = torch.nn.Linear(128, 1)
def forward(self, observations, state=None):
hidden = self.net(observations)
logits = self.action_head(hidden)
values = self.value_head(hidden)
return logits, values
Use your custom policy:
env_name = 'puffer_breakout'
env_creator = pufferlib.ocean.env_creator(env_name)
vecenv = pufferlib.vector.make(
env_creator,
num_envs=2,
num_workers=2,
batch_size=1,
backend=pufferlib.vector.Multiprocessing,
env_kwargs={'num_envs': 4096}
)
policy = CustomPolicy(vecenv.driver_env).cuda()
args = pufferl.load_config('default')
args['train']['env'] = env_name
trainer = pufferl.PuffeRL(args['train'], vecenv, policy)
for epoch in range(10):
trainer.evaluate()
logs = trainer.train()
trainer.print_dashboard()
trainer.close()
Your policy must implement forward() and return (logits, values). The state parameter is for recurrent policies (can be None).
Vectorization
PufferLib uses vectorization to run multiple environments in parallel. Here’s how to configure it:
import gymnasium
import pufferlib.vector
# Create environment factory
def make_env():
return gymnasium.make('CartPole-v1')
# Serial backend (single-threaded, for debugging)
vecenv = pufferlib.vector.make(
make_env,
num_envs=4,
backend=pufferlib.vector.Serial
)
# Multiprocessing backend (parallel workers)
vecenv = pufferlib.vector.make(
make_env,
num_envs=8,
num_workers=4,
batch_size=2,
backend=pufferlib.vector.Multiprocessing
)
Key parameters
num_envs: Total number of environments
num_workers: Number of parallel processes
batch_size: Number of environments processed per step
backend: Serial or Multiprocessing
Ensure num_envs divides num_workers, and both divide batch_size.
Using Gymnasium environments
Wrap any Gymnasium environment:
import gymnasium
import pufferlib.emulation
import pufferlib.vector
from pufferlib import pufferl
# Create Gymnasium environment
gym_env = gymnasium.make('CartPole-v1')
# Wrap as PufferEnv
puffer_env = pufferlib.emulation.GymnasiumPufferEnv(gym_env)
# Use with vectorization
def make_env():
env = gymnasium.make('CartPole-v1')
return pufferlib.emulation.GymnasiumPufferEnv(env)
vecenv = pufferlib.vector.make(
make_env,
num_envs=8,
num_workers=4,
backend=pufferlib.vector.Multiprocessing
)
# Train with PufferRL
args = pufferl.load_config('default')
policy = pufferl.load_policy(args, vecenv)
trainer = pufferl.PuffeRL(args['train'], vecenv, policy)
for epoch in range(100):
trainer.evaluate()
trainer.train()
trainer.close()
Evaluation
Evaluate a trained policy:
puffer eval puffer_breakout --policy checkpoints/latest.pt
Or in Python:
import torch
import pufferlib.ocean
import pufferlib.vector
# Load environment
env_creator = pufferlib.ocean.env_creator('puffer_breakout')
vecenv = pufferlib.vector.make(env_creator, num_envs=1)
# Load policy
policy = torch.load('checkpoints/latest.pt')
policy.eval()
# Run episodes
obs, _ = vecenv.reset()
for _ in range(1000):
with torch.no_grad():
logits, _ = policy(obs)
actions = torch.argmax(logits, dim=-1)
obs, rewards, dones, truncs, infos = vecenv.step(actions)
vecenv.close()
Distributed training
Scale training across multiple GPUs:
torchrun --standalone --nnodes=1 --nproc-per-node=4 \
-m pufferlib.pufferl train puffer_breakout
This spawns 4 processes, each with its own GPU.
PufferRL uses PyTorch DDP for distributed training. Gradients are synchronized across all processes.
Next steps
Core concepts
Understand PufferLib’s architecture
Training guide
Deep dive into PufferRL
Environment wrappers
Work with different environment types
API reference
Explore the full API
Example projects
Check out complete examples:
Get help
Stuck? We’re here to help: