PuffeRL

PuffeRL class

The main training class that handles PPO-based reinforcement learning with advanced features like V-trace, prioritized experience replay, and automatic mixed precision.

from pufferlib.pufferl import PuffeRL

pufferl = PuffeRL(config, vecenv, policy, logger=None)

Parameters

config

dict

required

Training configuration dictionary. See configuration parameters below.

vecenv

VectorEnv

required

Vectorized environment instance from pufferlib.vector.

policy

nn.Module

required

PyTorch policy network that returns actions and values.

logger

Logger

default:"None"

Optional logger instance (Neptune or Wandb). Uses NoLogger if not provided.

Configuration parameters

seed

int

default:"1"

Random seed for reproducibility.

device

str

default:"cuda"

Device to run training on (cuda or cpu).

batch_size

int | str

required

Total batch size for training. Use "auto" to calculate from bptt_horizon.

bptt_horizon

int | str

required

Sequence length for backpropagation through time. Use "auto" to calculate from batch_size.

minibatch_size

int

required

Size of minibatches for gradient updates.

max_minibatch_size

int

required

Maximum minibatch size for gradient accumulation.

update_epochs

int

default:"4"

Number of epochs to train on each batch of data.

total_timesteps

int

required

Total number of environment steps to train for.

learning_rate

float

default:"2.5e-4"

Initial learning rate for optimizer.

optimizer

str

default:"adam"

Optimizer to use (adam or muon).

adam_beta1

float

default:"0.9"

Beta1 parameter for Adam optimizer.

adam_beta2

float

default:"0.999"

Beta2 parameter for Adam optimizer.

adam_eps

float

default:"1e-8"

Epsilon parameter for Adam optimizer.

anneal_lr

bool

default:"true"

Whether to use cosine annealing learning rate schedule.

min_lr_ratio

float

default:"0.0"

Minimum learning rate as ratio of initial learning rate.

gamma

float

default:"0.99"

Discount factor for returns.

gae_lambda

float

default:"0.95"

Lambda parameter for Generalized Advantage Estimation.

clip_coef

float

default:"0.1"

PPO clipping coefficient for policy loss.

vf_coef

float

default:"0.5"

Value function loss coefficient.

vf_clip_coef

float

default:"0.1"

Value function clipping coefficient.

ent_coef

float

default:"0.01"

Entropy bonus coefficient.

max_grad_norm

float

default:"0.5"

Maximum gradient norm for gradient clipping.

vtrace_rho_clip

float

default:"1.0"

V-trace rho clipping parameter.

vtrace_c_clip

float

default:"1.0"

V-trace c clipping parameter.

prio_alpha

float

default:"0.0"

Prioritized experience replay alpha parameter.

prio_beta0

float

default:"0.0"

Initial prioritized experience replay beta parameter.

use_rnn

bool

default:"false"

Whether the policy uses recurrent neural networks.

cpu_offload

bool

default:"false"

Whether to offload observations to CPU memory.

compile

bool

default:"false"

Whether to use torch.compile for optimization.

compile_mode

str

default:"default"

Compilation mode for torch.compile.

precision

str

default:"float32"

Training precision (float32 or bfloat16).

amp

bool

default:"true"

Whether to use automatic mixed precision.

torch_deterministic

bool

default:"false"

Whether to use deterministic CUDA operations.

checkpoint_interval

int

default:"1"

How often to save checkpoints (in epochs).

data_dir

str

default:"experiments"

Directory to save checkpoints and models.

Methods

evaluate

Collect a batch of experience from the environment.

stats = pufferl.evaluate()

stats

dict

Dictionary of environment statistics collected during evaluation.

train

Train the policy on collected experience using PPO.

logs = pufferl.train()

logs

dict | None

Training logs including losses and metrics, or None if not logging this step.

save_checkpoint

Save model checkpoint to disk.

model_path = pufferl.save_checkpoint()

model_path

str

Path to saved model checkpoint.

close

Close the environment and save final checkpoint.

model_path = pufferl.close()

model_path

str

Path to final saved model.

print_dashboard

Print training dashboard with metrics and performance stats.

pufferl.print_dashboard(clear=False)

clear

bool

default:"False"

Whether to clear the console before printing.

Properties

uptime

float

Total training time in seconds since initialization.

sps

float

Steps per second (SPS) throughput.

epoch

int

Current training epoch number.

global_step

int

Total number of environment steps taken.

Training functions

train

High-level training function with built-in environment and policy loading.

from pufferlib.pufferl import train

all_logs = train(
    env_name,
    args=None,
    vecenv=None,
    policy=None,
    logger=None,
    early_stop_fn=None
)

env_name

str

required

Name of the environment to train on.

args

dict

default:"None"

Training configuration. Loads from config file if not provided.

vecenv

VectorEnv

default:"None"

Vectorized environment. Creates new environment if not provided.

policy

nn.Module

default:"None"

Policy network. Creates default policy if not provided.

logger

Logger

default:"None"

Logger instance. Creates from args if not provided.

early_stop_fn

callable

default:"None"

Function that takes logs and returns whether to stop early.

all_logs

list[dict]

List of all training logs from the run.

eval

Run evaluation with a trained policy.

from pufferlib.pufferl import eval

eval(env_name, args=None, vecenv=None, policy=None)

env_name

str

required

Name of the environment to evaluate on.

args

dict

default:"None"

Evaluation configuration.

vecenv

VectorEnv

default:"None"

Vectorized environment.

policy

nn.Module

default:"None"

Policy network to evaluate.

sweep

Run hyperparameter sweep with PufferLib’s optimization methods.

from pufferlib.pufferl import sweep

sweep(args=None, env_name=None)

args

dict

default:"None"

Sweep configuration including method and parameters.

env_name

str

default:"None"

Name of the environment to sweep on.

profile

Profile training performance using PyTorch profiler.

from pufferlib.pufferl import profile

profile(args=None, env_name=None, vecenv=None, policy=None)

args

dict

default:"None"

Configuration dictionary.

env_name

str

default:"None"

Name of the environment.

vecenv

VectorEnv

default:"None"

Vectorized environment.

policy

nn.Module

default:"None"

Policy network.

Runs 10 training iterations with PyTorch profiler enabled and exports results to trace.json.

export

Export policy weights to binary file.

from pufferlib.pufferl import export

export(args=None, env_name=None, vecenv=None, policy=None)

args

dict

default:"None"

Configuration dictionary.

env_name

str

default:"None"

Name of the environment.

vecenv

VectorEnv

default:"None"

Vectorized environment.

policy

nn.Module

default:"None"

Policy network to export.

Saves flattened policy weights to {env_name}_weights.bin.

autotune

Automatically tune vectorization parameters for optimal performance.

from pufferlib.pufferl import autotune

autotune(args=None, env_name=None, vecenv=None, policy=None)

args

dict

default:"None"

Configuration dictionary with batch_size.

env_name

str

default:"None"

Name of the environment.

vecenv

VectorEnv

default:"None"

Vectorized environment.

policy

nn.Module

default:"None"

Policy network.

Automatically determines optimal number of environments and workers.

Utility functions

compute_puff_advantage

Compute PufferLib’s custom advantage function with V-trace.

from pufferlib.pufferl import compute_puff_advantage

advantages = compute_puff_advantage(
    values, rewards, terminals,
    ratio, advantages, gamma,
    gae_lambda, vtrace_rho_clip, vtrace_c_clip
)

values

torch.Tensor

required

Value predictions.

rewards

torch.Tensor

required

Rewards from environment.

terminals

torch.Tensor

required

Terminal flags.

ratio

torch.Tensor

required

Importance sampling ratios.

advantages

torch.Tensor

required

Output tensor for advantages.

gamma

float

required

Discount factor.

gae_lambda

float

required

GAE lambda parameter.

vtrace_rho_clip

float

required

V-trace rho clipping.

vtrace_c_clip

float

required

V-trace c clipping.

advantages

torch.Tensor

Computed advantages.

load_config

Load configuration from environment name.

from pufferlib.pufferl import load_config

config = load_config(env_name, parser=None)

env_name

str

required

Name of the environment.

parser

argparse.ArgumentParser

default:"None"

Optional argument parser to extend.

config

dict

Configuration dictionary.

load_config_file

Load configuration from a file path.

from pufferlib.pufferl import load_config_file

config = load_config_file(file_path, fill_in_default=True, parser=None)

file_path

str

required

Path to configuration file.

fill_in_default

bool

default:"True"

Whether to fill in default values.

parser

argparse.ArgumentParser

default:"None"

Optional argument parser to extend.

config

dict

Configuration dictionary.

load_env

Load vectorized environment from configuration.

from pufferlib.pufferl import load_env

vecenv = load_env(env_name, args)

env_name

str

required

Name of the environment.

args

dict

required

Configuration dictionary.

vecenv

VectorEnv

Vectorized environment instance.

load_policy

Load policy network from configuration.

from pufferlib.pufferl import load_policy

policy = load_policy(args, vecenv, env_name='')

args

dict

required

Configuration dictionary.

vecenv

VectorEnv

required

Vectorized environment.

env_name

str

default:"''"

Environment name for checkpoint loading.

policy

nn.Module

Policy network instance.

Loggers

NoLogger

Default no-op logger when no external logging is configured.

from pufferlib.pufferl import NoLogger

logger = NoLogger(args)

NeptuneLogger

Neptune.ai integration for experiment tracking.

from pufferlib.pufferl import NeptuneLogger

logger = NeptuneLogger(args, load_id=None, mode='async')

WandbLogger

Weights & Biases integration for experiment tracking.

from pufferlib.pufferl import WandbLogger

logger = WandbLogger(args, load_id=None, resume='allow')

Core API

Training

Emulation

Utilities

PuffeRL class

Parameters

Configuration parameters

Methods

evaluate

train

save_checkpoint

close

print_dashboard

Properties

Training functions

train

eval

sweep

profile

export

autotune

Utility functions

compute_puff_advantage

load_config

load_config_file

load_env

load_policy

Loggers

NoLogger

NeptuneLogger

WandbLogger

Build docs developers (and LLMs) love

Core API

Training

Emulation

Utilities

​PuffeRL class

​Parameters

​Configuration parameters

​Methods

​evaluate

​train

​save_checkpoint

​close

​print_dashboard

​Properties

​Training functions

​train

​eval

​sweep

​profile

​export

​autotune

​Utility functions

​compute_puff_advantage

​load_config

​load_config_file

​load_env

​load_policy

​Loggers

​NoLogger

​NeptuneLogger

​WandbLogger

Build docs developers (and LLMs) love

PuffeRL class

Parameters

Configuration parameters

Methods

evaluate

train

save_checkpoint

close

print_dashboard

Properties

Training functions

train

eval

sweep

profile

export

autotune

Utility functions

compute_puff_advantage

load_config

load_config_file

load_env

load_policy

Loggers

NoLogger

NeptuneLogger

WandbLogger