PuffeRL class
The main training class that handles PPO-based reinforcement learning with advanced features like V-trace, prioritized experience replay, and automatic mixed precision.Parameters
Training configuration dictionary. See configuration parameters below.
Vectorized environment instance from
pufferlib.vector.PyTorch policy network that returns actions and values.
Optional logger instance (Neptune or Wandb). Uses
NoLogger if not provided.Configuration parameters
Random seed for reproducibility.
Device to run training on (
cuda or cpu).Total batch size for training. Use
"auto" to calculate from bptt_horizon.Sequence length for backpropagation through time. Use
"auto" to calculate from batch_size.Size of minibatches for gradient updates.
Maximum minibatch size for gradient accumulation.
Number of epochs to train on each batch of data.
Total number of environment steps to train for.
Initial learning rate for optimizer.
Optimizer to use (
adam or muon).Beta1 parameter for Adam optimizer.
Beta2 parameter for Adam optimizer.
Epsilon parameter for Adam optimizer.
Whether to use cosine annealing learning rate schedule.
Minimum learning rate as ratio of initial learning rate.
Discount factor for returns.
Lambda parameter for Generalized Advantage Estimation.
PPO clipping coefficient for policy loss.
Value function loss coefficient.
Value function clipping coefficient.
Entropy bonus coefficient.
Maximum gradient norm for gradient clipping.
V-trace rho clipping parameter.
V-trace c clipping parameter.
Prioritized experience replay alpha parameter.
Initial prioritized experience replay beta parameter.
Whether the policy uses recurrent neural networks.
Whether to offload observations to CPU memory.
Whether to use
torch.compile for optimization.Compilation mode for
torch.compile.Training precision (
float32 or bfloat16).Whether to use automatic mixed precision.
Whether to use deterministic CUDA operations.
How often to save checkpoints (in epochs).
Directory to save checkpoints and models.
Methods
evaluate
Collect a batch of experience from the environment.Dictionary of environment statistics collected during evaluation.
train
Train the policy on collected experience using PPO.Training logs including losses and metrics, or
None if not logging this step.save_checkpoint
Save model checkpoint to disk.Path to saved model checkpoint.
close
Close the environment and save final checkpoint.Path to final saved model.
print_dashboard
Print training dashboard with metrics and performance stats.Whether to clear the console before printing.
Properties
Total training time in seconds since initialization.
Steps per second (SPS) throughput.
Current training epoch number.
Total number of environment steps taken.
Training functions
train
High-level training function with built-in environment and policy loading.Name of the environment to train on.
Training configuration. Loads from config file if not provided.
Vectorized environment. Creates new environment if not provided.
Policy network. Creates default policy if not provided.
Logger instance. Creates from args if not provided.
Function that takes logs and returns whether to stop early.
List of all training logs from the run.
eval
Run evaluation with a trained policy.Name of the environment to evaluate on.
Evaluation configuration.
Vectorized environment.
Policy network to evaluate.
sweep
Run hyperparameter sweep with PufferLib’s optimization methods.Sweep configuration including method and parameters.
Name of the environment to sweep on.
profile
Profile training performance using PyTorch profiler.Configuration dictionary.
Name of the environment.
Vectorized environment.
Policy network.
trace.json.
export
Export policy weights to binary file.Configuration dictionary.
Name of the environment.
Vectorized environment.
Policy network to export.
{env_name}_weights.bin.
autotune
Automatically tune vectorization parameters for optimal performance.Configuration dictionary with batch_size.
Name of the environment.
Vectorized environment.
Policy network.
Utility functions
compute_puff_advantage
Compute PufferLib’s custom advantage function with V-trace.Value predictions.
Rewards from environment.
Terminal flags.
Importance sampling ratios.
Output tensor for advantages.
Discount factor.
GAE lambda parameter.
V-trace rho clipping.
V-trace c clipping.
Computed advantages.
load_config
Load configuration from environment name.Name of the environment.
Optional argument parser to extend.
Configuration dictionary.
load_config_file
Load configuration from a file path.Path to configuration file.
Whether to fill in default values.
Optional argument parser to extend.
Configuration dictionary.
load_env
Load vectorized environment from configuration.Name of the environment.
Configuration dictionary.
Vectorized environment instance.
load_policy
Load policy network from configuration.Configuration dictionary.
Vectorized environment.
Environment name for checkpoint loading.
Policy network instance.