Performance tuning

PufferLib provides extensive performance optimization options to maximize training throughput. This guide covers vectorization strategies, memory management, and computational optimizations.

Vectorization backends

PufferLib supports multiple vectorization backends for parallel environment execution:

Serial backend

Runs environments sequentially on a single process. Use for debugging or small-scale experiments:

import pufferlib.vector

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=8,
    backend='Serial'
)

From pufferlib/vector.py:52-170

Multiprocessing backend

Runs environments in parallel across multiple CPU cores. Recommended for most use cases:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=16,
    batch_size=64,
    backend='Multiprocessing'
)

For optimal performance, use 1 worker per physical CPU core (not logical/hyperthreaded cores).

From pufferlib/vector.py:226-488

Ray backend

Distributes environments across a cluster using Ray. Use for distributed simulation:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=512,
    num_workers=64,
    backend='Ray'
)

From pufferlib/vector.py:490-615

CPU offloading

Offload observations to CPU memory to reduce GPU VRAM usage:

config = {
    'device': 'cuda',
    'cpu_offload': True,  # Store observations on CPU
    'batch_size': 32768,
}

pufferl = pufferlib.PuffeRL(config, vecenv, policy)

Implementation

Observations are pinned to CPU memory for fast transfer:

self.observations = torch.zeros(
    segments, horizon, *obs_space.shape,
    dtype=pufferlib.pytorch.numpy_to_torch_dtype_dict[obs_space.dtype],
    pin_memory=device == 'cuda' and config['cpu_offload'],
    device='cpu' if config['cpu_offload'] else device
)

From pufferlib/pufferl.py:95-98

Copy operations

Data is transferred asynchronously during evaluation:

if config['cpu_offload']:
    self.observations[batch_rows, l] = o  # Store on CPU
else:
    self.observations[batch_rows, l] = o_device  # Store on GPU

From pufferlib/pufferl.py:287-290

CPU offloading adds transfer overhead. Only enable for large observation spaces that don’t fit in GPU memory.

Memory optimization

Batch size configuration

Balance memory usage and training efficiency:

config = {
    'batch_size': 32768,      # Total experience per update
    'bptt_horizon': 16,       # Sequence length
    'minibatch_size': 4096,   # Gradient update batch size
    'max_minibatch_size': 4096,  # Maximum for memory
}

Auto-sizing

Automatically calculate batch size or horizon:

config = {
    'batch_size': 'auto',  # Auto-calculate from bptt_horizon
    'bptt_horizon': 16,
    # batch_size = total_agents * bptt_horizon
}

From pufferlib/pufferl.py:78-83

Gradient accumulation

Process large minibatches in smaller chunks:

config = {
    'minibatch_size': 8192,      # Logical minibatch size
    'max_minibatch_size': 2048,  # Physical minibatch (fits in memory)
}

# Accumulates 4 physical minibatches (8192 / 2048)

From pufferlib/pufferl.py:120-138

Computational optimizations

Torch compile

Enable JIT compilation for faster model execution:

config = {
    'compile': True,
    'compile_mode': 'default',  # or 'reduce-overhead', 'max-autotune'
}

Implementation:

if config['compile']:
    self.policy = torch.compile(policy, mode=config['compile_mode'])
    self.policy.forward_eval = torch.compile(policy, mode=config['compile_mode'])
    pufferlib.pytorch.sample_logits = torch.compile(
        pufferlib.pytorch.sample_logits, mode=config['compile_mode']
    )

From pufferlib/pufferl.py:143-146

Mixed precision training

Use automatic mixed precision for faster computation:

config = {
    'device': 'cuda',
    'amp': True,  # Enable automatic mixed precision
    'precision': 'bfloat16',  # or 'float32'
}

Use bfloat16 on modern GPUs (Ampere/Ada) for better numerical stability than float16.

From pufferlib/pufferl.py:193-198

Backend optimizations

Configure PyTorch backend settings:

# Set in PuffeRL.__init__
torch.set_float32_matmul_precision('high')
torch.backends.cudnn.deterministic = config['torch_deterministic']
torch.backends.cudnn.benchmark = True

From pufferlib/pufferl.py:60-62

Vectorization tuning

Zero-copy optimization

Enable zero-copy for contiguous worker blocks:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=16,
    batch_size=64,
    zero_copy=True,  # Requires batch_size divides num_envs
    backend='Multiprocessing'
)

With zero_copy=True, num_envs must be divisible by batch_size.

From pufferlib/vector.py:256-259

Synchronous trajectories

Force trajectory synchronization for consistent experience:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=16,
    sync_traj=True,  # Wait for all workers
    backend='Multiprocessing'
)

From pufferlib/vector.py:240

Worker oversubscription

Allow more workers than physical cores:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=256,
    num_workers=32,  # More than physical cores
    overwork=True,   # Bypass the safety check
    backend='Multiprocessing'
)

Oversubscription usually decreases performance. Only use for I/O-bound environments.

From pufferlib/vector.py:246-253

Autotuning

Automatically find optimal vectorization parameters:

import pufferlib.vector

def make_env():
    return your_env

pufferlib.vector.autotune(
    make_env,
    batch_size=4096,
    max_envs=256,
    time_per_test=5
)

From pufferlib/vector.py:740-925

Autotune output

Autotune profiles different configurations:

Profiling single-core performance for ~ 5 seconds
Profile complete
    SPS: 1234.567
    STD: 5.234%
    Reset: 2.156%
    RAM: 128.456 MB/env
    Bandwidth: 0.234 GB/s
    Throughput: 3.744 GB/s (16 cores)

SPS: 15234.567
    num_envs: 128
    num_workers: 16
    batch_size: 64
    backend: Multiprocessing

SPS: 14123.456
    num_envs: 256
    num_workers: 16
    batch_size: 64
    zero_copy: False
    backend: Multiprocessing

Performance monitoring

Built-in profiling

PufferLib tracks execution time for each training phase:

class Profile:
    def __init__(self, frequency=5):
        self.profiles = defaultdict(lambda: defaultdict(float))
        self.frequency = frequency
        
    def __call__(self, name, epoch, nest=False):
        if (epoch + 1) % self.frequency != 0:
            return
        
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        
        tick = time.time()
        # ... profiling logic

From pufferlib/pufferl.py:722-747

Dashboard metrics

The live dashboard shows performance breakdown:

Performance    Time      %
Evaluate      1.23s    45%
  Forward     0.45s    16%
  Env         0.67s    24%
  Copy        0.08s     3%
  Misc        0.03s     1%
Train         1.51s    55%
  Forward     0.89s    32%
  Learn       0.52s    19%
  Copy        0.05s     2%
  Misc        0.05s     2%

Optimized configuration example

import pufferlib
import pufferlib.vector

# Environment vectorization
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=256,           # 256 parallel environments
    num_workers=16,         # 16 physical CPU cores
    batch_size=128,         # Process 128 at a time
    zero_copy=True,         # Enable zero-copy optimization
    backend='Multiprocessing'
)

# Training configuration
config = {
    # Device settings
    'device': 'cuda',
    'cpu_offload': False,   # Keep in GPU if fits
    
    # Batch configuration
    'batch_size': 32768,
    'bptt_horizon': 16,
    'minibatch_size': 4096,
    'max_minibatch_size': 4096,
    
    # Optimization
    'compile': True,
    'compile_mode': 'default',
    'amp': True,
    'precision': 'bfloat16',
    
    # Backend
    'torch_deterministic': False,
    
    # Learning
    'learning_rate': 3e-4,
    'update_epochs': 4,
    'total_timesteps': 1_000_000_000,
}

pufferl = pufferlib.PuffeRL(config, vecenv, policy)

Profile with different configurations using autotune before committing to long training runs.

Bottleneck identification

Environment bottleneck

If “Env” time dominates:

Increase num_workers
Optimize environment implementation
Use faster simulation backend

Forward pass bottleneck

If “Forward” time dominates:

Enable compile=True
Use mixed precision (amp=True)
Reduce model size
Increase batch size

Copy bottleneck

If “Copy” time dominates:

Enable zero_copy=True
Reduce observation size
Disable cpu_offload if possible

Memory bottleneck

If running out of memory:

Enable cpu_offload=True
Reduce batch_size
Reduce max_minibatch_size
Use gradient accumulation

Getting Started

Core Concepts

Training

Environment Wrappers

Ocean Environments

Advanced

Examples

Performance tuning

Vectorization backends

Serial backend

Multiprocessing backend

Ray backend

CPU offloading

Implementation

Copy operations

Memory optimization

Batch size configuration

Auto-sizing

Gradient accumulation

Computational optimizations

Torch compile

Mixed precision training

Backend optimizations

Vectorization tuning

Zero-copy optimization

Synchronous trajectories

Worker oversubscription

Autotuning

Autotune output

Performance monitoring

Built-in profiling

Dashboard metrics

Optimized configuration example

Bottleneck identification

Environment bottleneck

Forward pass bottleneck

Copy bottleneck

Memory bottleneck

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Environment Wrappers

Ocean Environments

Advanced

Examples

​Vectorization backends

​Serial backend

​Multiprocessing backend

​Ray backend

​CPU offloading

​Implementation

​Copy operations

​Memory optimization

​Batch size configuration

​Auto-sizing

​Gradient accumulation

​Computational optimizations

​Torch compile

​Mixed precision training

​Backend optimizations

​Vectorization tuning

​Zero-copy optimization

​Synchronous trajectories

​Worker oversubscription

​Autotuning

​Autotune output

​Performance monitoring

​Built-in profiling

​Dashboard metrics

​Optimized configuration example

​Bottleneck identification

​Environment bottleneck

​Forward pass bottleneck

​Copy bottleneck

​Memory bottleneck

Build docs developers (and LLMs) love

Vectorization backends

Serial backend

Multiprocessing backend

Ray backend

CPU offloading

Implementation

Copy operations

Memory optimization

Batch size configuration

Auto-sizing

Gradient accumulation

Computational optimizations

Torch compile

Mixed precision training

Backend optimizations

Vectorization tuning

Zero-copy optimization

Synchronous trajectories

Worker oversubscription

Autotuning

Autotune output

Performance monitoring

Built-in profiling

Dashboard metrics

Optimized configuration example

Bottleneck identification

Environment bottleneck

Forward pass bottleneck

Copy bottleneck

Memory bottleneck