Skip to main content

PuffeRL class

The main training class that handles PPO-based reinforcement learning with advanced features like V-trace, prioritized experience replay, and automatic mixed precision.
from pufferlib.pufferl import PuffeRL

pufferl = PuffeRL(config, vecenv, policy, logger=None)

Parameters

config
dict
required
Training configuration dictionary. See configuration parameters below.
vecenv
VectorEnv
required
Vectorized environment instance from pufferlib.vector.
policy
nn.Module
required
PyTorch policy network that returns actions and values.
logger
Logger
default:"None"
Optional logger instance (Neptune or Wandb). Uses NoLogger if not provided.

Configuration parameters

seed
int
default:"1"
Random seed for reproducibility.
device
str
default:"cuda"
Device to run training on (cuda or cpu).
batch_size
int | str
required
Total batch size for training. Use "auto" to calculate from bptt_horizon.
bptt_horizon
int | str
required
Sequence length for backpropagation through time. Use "auto" to calculate from batch_size.
minibatch_size
int
required
Size of minibatches for gradient updates.
max_minibatch_size
int
required
Maximum minibatch size for gradient accumulation.
update_epochs
int
default:"4"
Number of epochs to train on each batch of data.
total_timesteps
int
required
Total number of environment steps to train for.
learning_rate
float
default:"2.5e-4"
Initial learning rate for optimizer.
optimizer
str
default:"adam"
Optimizer to use (adam or muon).
adam_beta1
float
default:"0.9"
Beta1 parameter for Adam optimizer.
adam_beta2
float
default:"0.999"
Beta2 parameter for Adam optimizer.
adam_eps
float
default:"1e-8"
Epsilon parameter for Adam optimizer.
anneal_lr
bool
default:"true"
Whether to use cosine annealing learning rate schedule.
min_lr_ratio
float
default:"0.0"
Minimum learning rate as ratio of initial learning rate.
gamma
float
default:"0.99"
Discount factor for returns.
gae_lambda
float
default:"0.95"
Lambda parameter for Generalized Advantage Estimation.
clip_coef
float
default:"0.1"
PPO clipping coefficient for policy loss.
vf_coef
float
default:"0.5"
Value function loss coefficient.
vf_clip_coef
float
default:"0.1"
Value function clipping coefficient.
ent_coef
float
default:"0.01"
Entropy bonus coefficient.
max_grad_norm
float
default:"0.5"
Maximum gradient norm for gradient clipping.
vtrace_rho_clip
float
default:"1.0"
V-trace rho clipping parameter.
vtrace_c_clip
float
default:"1.0"
V-trace c clipping parameter.
prio_alpha
float
default:"0.0"
Prioritized experience replay alpha parameter.
prio_beta0
float
default:"0.0"
Initial prioritized experience replay beta parameter.
use_rnn
bool
default:"false"
Whether the policy uses recurrent neural networks.
cpu_offload
bool
default:"false"
Whether to offload observations to CPU memory.
compile
bool
default:"false"
Whether to use torch.compile for optimization.
compile_mode
str
default:"default"
Compilation mode for torch.compile.
precision
str
default:"float32"
Training precision (float32 or bfloat16).
amp
bool
default:"true"
Whether to use automatic mixed precision.
torch_deterministic
bool
default:"false"
Whether to use deterministic CUDA operations.
checkpoint_interval
int
default:"1"
How often to save checkpoints (in epochs).
data_dir
str
default:"experiments"
Directory to save checkpoints and models.

Methods

evaluate

Collect a batch of experience from the environment.
stats = pufferl.evaluate()
stats
dict
Dictionary of environment statistics collected during evaluation.

train

Train the policy on collected experience using PPO.
logs = pufferl.train()
logs
dict | None
Training logs including losses and metrics, or None if not logging this step.

save_checkpoint

Save model checkpoint to disk.
model_path = pufferl.save_checkpoint()
model_path
str
Path to saved model checkpoint.

close

Close the environment and save final checkpoint.
model_path = pufferl.close()
model_path
str
Path to final saved model.
Print training dashboard with metrics and performance stats.
pufferl.print_dashboard(clear=False)
clear
bool
default:"False"
Whether to clear the console before printing.

Properties

uptime
float
Total training time in seconds since initialization.
sps
float
Steps per second (SPS) throughput.
epoch
int
Current training epoch number.
global_step
int
Total number of environment steps taken.

Training functions

train

High-level training function with built-in environment and policy loading.
from pufferlib.pufferl import train

all_logs = train(
    env_name,
    args=None,
    vecenv=None,
    policy=None,
    logger=None,
    early_stop_fn=None
)
env_name
str
required
Name of the environment to train on.
args
dict
default:"None"
Training configuration. Loads from config file if not provided.
vecenv
VectorEnv
default:"None"
Vectorized environment. Creates new environment if not provided.
policy
nn.Module
default:"None"
Policy network. Creates default policy if not provided.
logger
Logger
default:"None"
Logger instance. Creates from args if not provided.
early_stop_fn
callable
default:"None"
Function that takes logs and returns whether to stop early.
all_logs
list[dict]
List of all training logs from the run.

eval

Run evaluation with a trained policy.
from pufferlib.pufferl import eval

eval(env_name, args=None, vecenv=None, policy=None)
env_name
str
required
Name of the environment to evaluate on.
args
dict
default:"None"
Evaluation configuration.
vecenv
VectorEnv
default:"None"
Vectorized environment.
policy
nn.Module
default:"None"
Policy network to evaluate.

sweep

Run hyperparameter sweep with PufferLib’s optimization methods.
from pufferlib.pufferl import sweep

sweep(args=None, env_name=None)
args
dict
default:"None"
Sweep configuration including method and parameters.
env_name
str
default:"None"
Name of the environment to sweep on.

profile

Profile training performance using PyTorch profiler.
from pufferlib.pufferl import profile

profile(args=None, env_name=None, vecenv=None, policy=None)
args
dict
default:"None"
Configuration dictionary.
env_name
str
default:"None"
Name of the environment.
vecenv
VectorEnv
default:"None"
Vectorized environment.
policy
nn.Module
default:"None"
Policy network.
Runs 10 training iterations with PyTorch profiler enabled and exports results to trace.json.

export

Export policy weights to binary file.
from pufferlib.pufferl import export

export(args=None, env_name=None, vecenv=None, policy=None)
args
dict
default:"None"
Configuration dictionary.
env_name
str
default:"None"
Name of the environment.
vecenv
VectorEnv
default:"None"
Vectorized environment.
policy
nn.Module
default:"None"
Policy network to export.
Saves flattened policy weights to {env_name}_weights.bin.

autotune

Automatically tune vectorization parameters for optimal performance.
from pufferlib.pufferl import autotune

autotune(args=None, env_name=None, vecenv=None, policy=None)
args
dict
default:"None"
Configuration dictionary with batch_size.
env_name
str
default:"None"
Name of the environment.
vecenv
VectorEnv
default:"None"
Vectorized environment.
policy
nn.Module
default:"None"
Policy network.
Automatically determines optimal number of environments and workers.

Utility functions

compute_puff_advantage

Compute PufferLib’s custom advantage function with V-trace.
from pufferlib.pufferl import compute_puff_advantage

advantages = compute_puff_advantage(
    values, rewards, terminals,
    ratio, advantages, gamma,
    gae_lambda, vtrace_rho_clip, vtrace_c_clip
)
values
torch.Tensor
required
Value predictions.
rewards
torch.Tensor
required
Rewards from environment.
terminals
torch.Tensor
required
Terminal flags.
ratio
torch.Tensor
required
Importance sampling ratios.
advantages
torch.Tensor
required
Output tensor for advantages.
gamma
float
required
Discount factor.
gae_lambda
float
required
GAE lambda parameter.
vtrace_rho_clip
float
required
V-trace rho clipping.
vtrace_c_clip
float
required
V-trace c clipping.
advantages
torch.Tensor
Computed advantages.

load_config

Load configuration from environment name.
from pufferlib.pufferl import load_config

config = load_config(env_name, parser=None)
env_name
str
required
Name of the environment.
parser
argparse.ArgumentParser
default:"None"
Optional argument parser to extend.
config
dict
Configuration dictionary.

load_config_file

Load configuration from a file path.
from pufferlib.pufferl import load_config_file

config = load_config_file(file_path, fill_in_default=True, parser=None)
file_path
str
required
Path to configuration file.
fill_in_default
bool
default:"True"
Whether to fill in default values.
parser
argparse.ArgumentParser
default:"None"
Optional argument parser to extend.
config
dict
Configuration dictionary.

load_env

Load vectorized environment from configuration.
from pufferlib.pufferl import load_env

vecenv = load_env(env_name, args)
env_name
str
required
Name of the environment.
args
dict
required
Configuration dictionary.
vecenv
VectorEnv
Vectorized environment instance.

load_policy

Load policy network from configuration.
from pufferlib.pufferl import load_policy

policy = load_policy(args, vecenv, env_name='')
args
dict
required
Configuration dictionary.
vecenv
VectorEnv
required
Vectorized environment.
env_name
str
default:"''"
Environment name for checkpoint loading.
policy
nn.Module
Policy network instance.

Loggers

NoLogger

Default no-op logger when no external logging is configured.
from pufferlib.pufferl import NoLogger

logger = NoLogger(args)

NeptuneLogger

Neptune.ai integration for experiment tracking.
from pufferlib.pufferl import NeptuneLogger

logger = NeptuneLogger(args, load_id=None, mode='async')

WandbLogger

Weights & Biases integration for experiment tracking.
from pufferlib.pufferl import WandbLogger

logger = WandbLogger(args, load_id=None, resume='allow')

Build docs developers (and LLMs) love