Architecture

PufferLib is a high-performance reinforcement learning framework built around three core architectural layers: environments, vectorization, and training. This design enables efficient parallel simulation and training at over 1M steps per second.

Design philosophy

PufferLib was created to solve common pain points in RL development:

Compatibility: Work seamlessly with any environment standard (Gymnasium, PettingZoo, or native)
Performance: Achieve maximum throughput through optimized vectorization and zero-copy operations
Simplicity: Provide a clean API that abstracts complexity while maintaining flexibility
Scalability: Support everything from single-process training to distributed clusters

Three-layer architecture

PufferLib’s architecture consists of three distinct layers that work together:

1. Environment layer

The foundation of PufferLib is the PufferEnv base class, which defines a standardized interface for all environments. This layer handles:

Observation and action spaces: Defines what agents can see and do
Agent management: Supports single and multi-agent environments
State updates: Manages environment state transitions
Shared memory buffers: Enables zero-copy vectorization

Environments can be:

Native PufferEnvs: Built directly with PufferLib for maximum performance
Emulated environments: Wrapped Gymnasium or PettingZoo environments

Native PufferEnvs handle all agents in a single Python instance and write directly to shared memory buffers, eliminating serialization overhead.

2. Vectorization layer

The vectorization layer parallelizes environment execution across multiple processes or threads. PufferLib provides three backends:

Serial: Single-process execution for debugging and simple use cases
Multiprocessing: Parallel execution across CPU cores with zero-copy shared memory
Ray: Distributed execution across machines

This layer is responsible for:

Managing worker processes
Coordinating data transfer between workers
Batching observations and actions
Handling asynchronous execution

3. Training layer

The training layer (PufferTank) sits on top of vectorization and handles:

Policy networks and value functions
Learning algorithms (PPO, etc.)
Experience collection and replay buffers
Optimization and gradient updates
Logging and checkpointing

While this documentation focuses on the environment and vectorization layers, PufferTank completes the stack by providing production-ready training implementations.

Data flow

Here’s how data flows through the architecture:

Policy inference: The policy network generates actions for a batch of observations
Action distribution: Actions are sent to worker processes via shared memory or pipes
Environment execution: Each worker steps its environments with the provided actions
Observation collection: New observations are written to shared buffers
Batch assembly: The vectorization layer assembles observations into training batches
Experience processing: The training layer processes experiences and updates the policy

# Simplified data flow example
import pufferlib
import pufferlib.vector

# Create vectorized environments
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=8,
    backend=pufferlib.vector.Multiprocessing
)

# Reset to get initial observations
observations, infos = vecenv.reset()

# Training loop
while training:
    # 1. Policy generates actions from observations
    actions = policy(observations)
    
    # 2. Step environments
    observations, rewards, terminals, truncations, infos = vecenv.step(actions)
    
    # 3. Process experiences for learning
    train(observations, rewards, terminals, truncations)

Memory management

PufferLib uses shared memory buffers to minimize data copying:

Pre-allocated buffers: All observation, action, reward, and done arrays are pre-allocated
In-place updates: Environments write directly to shared buffers
Zero-copy transfer: Data moves between processes without serialization

This design is critical for achieving high throughput:

class MyEnv(pufferlib.PufferEnv):
    def step(self, actions):
        # Write directly to pre-allocated buffers
        self.observations[:] = compute_observations()
        self.rewards[:] = compute_rewards()
        self.terminals[:] = compute_terminals()
        
        return self.observations, self.rewards, self.terminals, self.truncations, infos

Space handling

PufferLib standardizes how observation and action spaces work:

single_observation_space: The observation space for one agent
single_action_space: The action space for one agent
observation_space: Joint space for all agents (automatically computed)
action_space: Joint action space for all agents (automatically computed)

This dual-space system makes it easy to work with both single and multi-agent environments:

class MultiAgentEnv(pufferlib.PufferEnv):
    def __init__(self, buf=None, seed=0):
        # Define spaces for a single agent
        self.single_observation_space = gymnasium.spaces.Box(low=0, high=1, shape=(4,))
        self.single_action_space = gymnasium.spaces.Discrete(3)
        self.num_agents = 16  # Environment has 16 agents
        
        # PufferEnv automatically creates joint spaces:
        # observation_space: Box(shape=(16, 4))
        # action_space: MultiDiscrete([3] * 16)
        super().__init__(buf)

Extension points

The architecture provides several extension points:

Custom environments: Subclass PufferEnv for native environments
Custom vectorization: Implement new backends for specialized hardware
Emulation layers: Add support for new environment standards
Wrappers: Modify environment behavior without changing the core

This modularity allows you to customize any layer while maintaining compatibility with the rest of the stack.

Getting Started

Core Concepts

Training

Environment Wrappers

Ocean Environments

Advanced

Examples

Design philosophy

Three-layer architecture

1. Environment layer

2. Vectorization layer

3. Training layer

Data flow

Memory management

Space handling

Extension points

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Environment Wrappers

Ocean Environments

Advanced

Examples

​Design philosophy

​Three-layer architecture

​1. Environment layer

​2. Vectorization layer

​3. Training layer

​Data flow

​Memory management

​Space handling

​Extension points

Build docs developers (and LLMs) love

Design philosophy

Three-layer architecture

1. Environment layer

2. Vectorization layer

3. Training layer

Data flow

Memory management

Space handling

Extension points