Skip to main content
PufferLib provides extensive performance optimization options to maximize training throughput. This guide covers vectorization strategies, memory management, and computational optimizations.

Vectorization backends

PufferLib supports multiple vectorization backends for parallel environment execution:

Serial backend

Runs environments sequentially on a single process. Use for debugging or small-scale experiments:
import pufferlib.vector

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=8,
    backend='Serial'
)
From pufferlib/vector.py:52-170

Multiprocessing backend

Runs environments in parallel across multiple CPU cores. Recommended for most use cases:
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=16,
    batch_size=64,
    backend='Multiprocessing'
)
For optimal performance, use 1 worker per physical CPU core (not logical/hyperthreaded cores).
From pufferlib/vector.py:226-488

Ray backend

Distributes environments across a cluster using Ray. Use for distributed simulation:
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=512,
    num_workers=64,
    backend='Ray'
)
From pufferlib/vector.py:490-615

CPU offloading

Offload observations to CPU memory to reduce GPU VRAM usage:
config = {
    'device': 'cuda',
    'cpu_offload': True,  # Store observations on CPU
    'batch_size': 32768,
}

pufferl = pufferlib.PuffeRL(config, vecenv, policy)

Implementation

Observations are pinned to CPU memory for fast transfer:
self.observations = torch.zeros(
    segments, horizon, *obs_space.shape,
    dtype=pufferlib.pytorch.numpy_to_torch_dtype_dict[obs_space.dtype],
    pin_memory=device == 'cuda' and config['cpu_offload'],
    device='cpu' if config['cpu_offload'] else device
)
From pufferlib/pufferl.py:95-98

Copy operations

Data is transferred asynchronously during evaluation:
if config['cpu_offload']:
    self.observations[batch_rows, l] = o  # Store on CPU
else:
    self.observations[batch_rows, l] = o_device  # Store on GPU
From pufferlib/pufferl.py:287-290
CPU offloading adds transfer overhead. Only enable for large observation spaces that don’t fit in GPU memory.

Memory optimization

Batch size configuration

Balance memory usage and training efficiency:
config = {
    'batch_size': 32768,      # Total experience per update
    'bptt_horizon': 16,       # Sequence length
    'minibatch_size': 4096,   # Gradient update batch size
    'max_minibatch_size': 4096,  # Maximum for memory
}

Auto-sizing

Automatically calculate batch size or horizon:
config = {
    'batch_size': 'auto',  # Auto-calculate from bptt_horizon
    'bptt_horizon': 16,
    # batch_size = total_agents * bptt_horizon
}
From pufferlib/pufferl.py:78-83

Gradient accumulation

Process large minibatches in smaller chunks:
config = {
    'minibatch_size': 8192,      # Logical minibatch size
    'max_minibatch_size': 2048,  # Physical minibatch (fits in memory)
}

# Accumulates 4 physical minibatches (8192 / 2048)
From pufferlib/pufferl.py:120-138

Computational optimizations

Torch compile

Enable JIT compilation for faster model execution:
config = {
    'compile': True,
    'compile_mode': 'default',  # or 'reduce-overhead', 'max-autotune'
}
Implementation:
if config['compile']:
    self.policy = torch.compile(policy, mode=config['compile_mode'])
    self.policy.forward_eval = torch.compile(policy, mode=config['compile_mode'])
    pufferlib.pytorch.sample_logits = torch.compile(
        pufferlib.pytorch.sample_logits, mode=config['compile_mode']
    )
From pufferlib/pufferl.py:143-146

Mixed precision training

Use automatic mixed precision for faster computation:
config = {
    'device': 'cuda',
    'amp': True,  # Enable automatic mixed precision
    'precision': 'bfloat16',  # or 'float32'
}
Use bfloat16 on modern GPUs (Ampere/Ada) for better numerical stability than float16.
From pufferlib/pufferl.py:193-198

Backend optimizations

Configure PyTorch backend settings:
# Set in PuffeRL.__init__
torch.set_float32_matmul_precision('high')
torch.backends.cudnn.deterministic = config['torch_deterministic']
torch.backends.cudnn.benchmark = True
From pufferlib/pufferl.py:60-62

Vectorization tuning

Zero-copy optimization

Enable zero-copy for contiguous worker blocks:
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=16,
    batch_size=64,
    zero_copy=True,  # Requires batch_size divides num_envs
    backend='Multiprocessing'
)
With zero_copy=True, num_envs must be divisible by batch_size.
From pufferlib/vector.py:256-259

Synchronous trajectories

Force trajectory synchronization for consistent experience:
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=16,
    sync_traj=True,  # Wait for all workers
    backend='Multiprocessing'
)
From pufferlib/vector.py:240

Worker oversubscription

Allow more workers than physical cores:
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=256,
    num_workers=32,  # More than physical cores
    overwork=True,   # Bypass the safety check
    backend='Multiprocessing'
)
Oversubscription usually decreases performance. Only use for I/O-bound environments.
From pufferlib/vector.py:246-253

Autotuning

Automatically find optimal vectorization parameters:
import pufferlib.vector

def make_env():
    return your_env

pufferlib.vector.autotune(
    make_env,
    batch_size=4096,
    max_envs=256,
    time_per_test=5
)
From pufferlib/vector.py:740-925

Autotune output

Autotune profiles different configurations:
Profiling single-core performance for ~ 5 seconds
Profile complete
    SPS: 1234.567
    STD: 5.234%
    Reset: 2.156%
    RAM: 128.456 MB/env
    Bandwidth: 0.234 GB/s
    Throughput: 3.744 GB/s (16 cores)

SPS: 15234.567
    num_envs: 128
    num_workers: 16
    batch_size: 64
    backend: Multiprocessing

SPS: 14123.456
    num_envs: 256
    num_workers: 16
    batch_size: 64
    zero_copy: False
    backend: Multiprocessing

Performance monitoring

Built-in profiling

PufferLib tracks execution time for each training phase:
class Profile:
    def __init__(self, frequency=5):
        self.profiles = defaultdict(lambda: defaultdict(float))
        self.frequency = frequency
        
    def __call__(self, name, epoch, nest=False):
        if (epoch + 1) % self.frequency != 0:
            return
        
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        
        tick = time.time()
        # ... profiling logic
From pufferlib/pufferl.py:722-747

Dashboard metrics

The live dashboard shows performance breakdown:
Performance    Time      %
Evaluate      1.23s    45%
  Forward     0.45s    16%
  Env         0.67s    24%
  Copy        0.08s     3%
  Misc        0.03s     1%
Train         1.51s    55%
  Forward     0.89s    32%
  Learn       0.52s    19%
  Copy        0.05s     2%
  Misc        0.05s     2%

Optimized configuration example

import pufferlib
import pufferlib.vector

# Environment vectorization
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=256,           # 256 parallel environments
    num_workers=16,         # 16 physical CPU cores
    batch_size=128,         # Process 128 at a time
    zero_copy=True,         # Enable zero-copy optimization
    backend='Multiprocessing'
)

# Training configuration
config = {
    # Device settings
    'device': 'cuda',
    'cpu_offload': False,   # Keep in GPU if fits
    
    # Batch configuration
    'batch_size': 32768,
    'bptt_horizon': 16,
    'minibatch_size': 4096,
    'max_minibatch_size': 4096,
    
    # Optimization
    'compile': True,
    'compile_mode': 'default',
    'amp': True,
    'precision': 'bfloat16',
    
    # Backend
    'torch_deterministic': False,
    
    # Learning
    'learning_rate': 3e-4,
    'update_epochs': 4,
    'total_timesteps': 1_000_000_000,
}

pufferl = pufferlib.PuffeRL(config, vecenv, policy)
Profile with different configurations using autotune before committing to long training runs.

Bottleneck identification

Environment bottleneck

If “Env” time dominates:
  • Increase num_workers
  • Optimize environment implementation
  • Use faster simulation backend

Forward pass bottleneck

If “Forward” time dominates:
  • Enable compile=True
  • Use mixed precision (amp=True)
  • Reduce model size
  • Increase batch size

Copy bottleneck

If “Copy” time dominates:
  • Enable zero_copy=True
  • Reduce observation size
  • Disable cpu_offload if possible

Memory bottleneck

If running out of memory:
  • Enable cpu_offload=True
  • Reduce batch_size
  • Reduce max_minibatch_size
  • Use gradient accumulation

Build docs developers (and LLMs) love