PufferLib provides extensive performance optimization options to maximize training throughput. This guide covers vectorization strategies, memory management, and computational optimizations.
Vectorization backends
PufferLib supports multiple vectorization backends for parallel environment execution:
Serial backend
Runs environments sequentially on a single process. Use for debugging or small-scale experiments:
import pufferlib.vector
vecenv = pufferlib.vector.make(
env_creator,
num_envs=8,
backend='Serial'
)
From pufferlib/vector.py:52-170
Multiprocessing backend
Runs environments in parallel across multiple CPU cores. Recommended for most use cases:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128,
num_workers=16,
batch_size=64,
backend='Multiprocessing'
)
For optimal performance, use 1 worker per physical CPU core (not logical/hyperthreaded cores).
From pufferlib/vector.py:226-488
Ray backend
Distributes environments across a cluster using Ray. Use for distributed simulation:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=512,
num_workers=64,
backend='Ray'
)
From pufferlib/vector.py:490-615
CPU offloading
Offload observations to CPU memory to reduce GPU VRAM usage:
config = {
'device': 'cuda',
'cpu_offload': True, # Store observations on CPU
'batch_size': 32768,
}
pufferl = pufferlib.PuffeRL(config, vecenv, policy)
Implementation
Observations are pinned to CPU memory for fast transfer:
self.observations = torch.zeros(
segments, horizon, *obs_space.shape,
dtype=pufferlib.pytorch.numpy_to_torch_dtype_dict[obs_space.dtype],
pin_memory=device == 'cuda' and config['cpu_offload'],
device='cpu' if config['cpu_offload'] else device
)
From pufferlib/pufferl.py:95-98
Copy operations
Data is transferred asynchronously during evaluation:
if config['cpu_offload']:
self.observations[batch_rows, l] = o # Store on CPU
else:
self.observations[batch_rows, l] = o_device # Store on GPU
From pufferlib/pufferl.py:287-290
CPU offloading adds transfer overhead. Only enable for large observation spaces that don’t fit in GPU memory.
Memory optimization
Batch size configuration
Balance memory usage and training efficiency:
config = {
'batch_size': 32768, # Total experience per update
'bptt_horizon': 16, # Sequence length
'minibatch_size': 4096, # Gradient update batch size
'max_minibatch_size': 4096, # Maximum for memory
}
Auto-sizing
Automatically calculate batch size or horizon:
config = {
'batch_size': 'auto', # Auto-calculate from bptt_horizon
'bptt_horizon': 16,
# batch_size = total_agents * bptt_horizon
}
From pufferlib/pufferl.py:78-83
Gradient accumulation
Process large minibatches in smaller chunks:
config = {
'minibatch_size': 8192, # Logical minibatch size
'max_minibatch_size': 2048, # Physical minibatch (fits in memory)
}
# Accumulates 4 physical minibatches (8192 / 2048)
From pufferlib/pufferl.py:120-138
Computational optimizations
Torch compile
Enable JIT compilation for faster model execution:
config = {
'compile': True,
'compile_mode': 'default', # or 'reduce-overhead', 'max-autotune'
}
Implementation:
if config['compile']:
self.policy = torch.compile(policy, mode=config['compile_mode'])
self.policy.forward_eval = torch.compile(policy, mode=config['compile_mode'])
pufferlib.pytorch.sample_logits = torch.compile(
pufferlib.pytorch.sample_logits, mode=config['compile_mode']
)
From pufferlib/pufferl.py:143-146
Mixed precision training
Use automatic mixed precision for faster computation:
config = {
'device': 'cuda',
'amp': True, # Enable automatic mixed precision
'precision': 'bfloat16', # or 'float32'
}
Use bfloat16 on modern GPUs (Ampere/Ada) for better numerical stability than float16.
From pufferlib/pufferl.py:193-198
Backend optimizations
Configure PyTorch backend settings:
# Set in PuffeRL.__init__
torch.set_float32_matmul_precision('high')
torch.backends.cudnn.deterministic = config['torch_deterministic']
torch.backends.cudnn.benchmark = True
From pufferlib/pufferl.py:60-62
Vectorization tuning
Zero-copy optimization
Enable zero-copy for contiguous worker blocks:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128,
num_workers=16,
batch_size=64,
zero_copy=True, # Requires batch_size divides num_envs
backend='Multiprocessing'
)
With zero_copy=True, num_envs must be divisible by batch_size.
From pufferlib/vector.py:256-259
Synchronous trajectories
Force trajectory synchronization for consistent experience:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128,
num_workers=16,
sync_traj=True, # Wait for all workers
backend='Multiprocessing'
)
From pufferlib/vector.py:240
Worker oversubscription
Allow more workers than physical cores:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=256,
num_workers=32, # More than physical cores
overwork=True, # Bypass the safety check
backend='Multiprocessing'
)
Oversubscription usually decreases performance. Only use for I/O-bound environments.
From pufferlib/vector.py:246-253
Autotuning
Automatically find optimal vectorization parameters:
import pufferlib.vector
def make_env():
return your_env
pufferlib.vector.autotune(
make_env,
batch_size=4096,
max_envs=256,
time_per_test=5
)
From pufferlib/vector.py:740-925
Autotune output
Autotune profiles different configurations:
Profiling single-core performance for ~ 5 seconds
Profile complete
SPS: 1234.567
STD: 5.234%
Reset: 2.156%
RAM: 128.456 MB/env
Bandwidth: 0.234 GB/s
Throughput: 3.744 GB/s (16 cores)
SPS: 15234.567
num_envs: 128
num_workers: 16
batch_size: 64
backend: Multiprocessing
SPS: 14123.456
num_envs: 256
num_workers: 16
batch_size: 64
zero_copy: False
backend: Multiprocessing
Built-in profiling
PufferLib tracks execution time for each training phase:
class Profile:
def __init__(self, frequency=5):
self.profiles = defaultdict(lambda: defaultdict(float))
self.frequency = frequency
def __call__(self, name, epoch, nest=False):
if (epoch + 1) % self.frequency != 0:
return
if torch.cuda.is_available():
torch.cuda.synchronize()
tick = time.time()
# ... profiling logic
From pufferlib/pufferl.py:722-747
Dashboard metrics
The live dashboard shows performance breakdown:
Performance Time %
Evaluate 1.23s 45%
Forward 0.45s 16%
Env 0.67s 24%
Copy 0.08s 3%
Misc 0.03s 1%
Train 1.51s 55%
Forward 0.89s 32%
Learn 0.52s 19%
Copy 0.05s 2%
Misc 0.05s 2%
Optimized configuration example
import pufferlib
import pufferlib.vector
# Environment vectorization
vecenv = pufferlib.vector.make(
env_creator,
num_envs=256, # 256 parallel environments
num_workers=16, # 16 physical CPU cores
batch_size=128, # Process 128 at a time
zero_copy=True, # Enable zero-copy optimization
backend='Multiprocessing'
)
# Training configuration
config = {
# Device settings
'device': 'cuda',
'cpu_offload': False, # Keep in GPU if fits
# Batch configuration
'batch_size': 32768,
'bptt_horizon': 16,
'minibatch_size': 4096,
'max_minibatch_size': 4096,
# Optimization
'compile': True,
'compile_mode': 'default',
'amp': True,
'precision': 'bfloat16',
# Backend
'torch_deterministic': False,
# Learning
'learning_rate': 3e-4,
'update_epochs': 4,
'total_timesteps': 1_000_000_000,
}
pufferl = pufferlib.PuffeRL(config, vecenv, policy)
Profile with different configurations using autotune before committing to long training runs.
Bottleneck identification
Environment bottleneck
If “Env” time dominates:
- Increase
num_workers
- Optimize environment implementation
- Use faster simulation backend
Forward pass bottleneck
If “Forward” time dominates:
- Enable
compile=True
- Use mixed precision (
amp=True)
- Reduce model size
- Increase batch size
Copy bottleneck
If “Copy” time dominates:
- Enable
zero_copy=True
- Reduce observation size
- Disable
cpu_offload if possible
Memory bottleneck
If running out of memory:
- Enable
cpu_offload=True
- Reduce
batch_size
- Reduce
max_minibatch_size
- Use gradient accumulation