Skip to main content
PufferLib is a high-performance reinforcement learning framework built around three core architectural layers: environments, vectorization, and training. This design enables efficient parallel simulation and training at over 1M steps per second.

Design philosophy

PufferLib was created to solve common pain points in RL development:
  • Compatibility: Work seamlessly with any environment standard (Gymnasium, PettingZoo, or native)
  • Performance: Achieve maximum throughput through optimized vectorization and zero-copy operations
  • Simplicity: Provide a clean API that abstracts complexity while maintaining flexibility
  • Scalability: Support everything from single-process training to distributed clusters

Three-layer architecture

PufferLib’s architecture consists of three distinct layers that work together:

1. Environment layer

The foundation of PufferLib is the PufferEnv base class, which defines a standardized interface for all environments. This layer handles:
  • Observation and action spaces: Defines what agents can see and do
  • Agent management: Supports single and multi-agent environments
  • State updates: Manages environment state transitions
  • Shared memory buffers: Enables zero-copy vectorization
Environments can be:
  • Native PufferEnvs: Built directly with PufferLib for maximum performance
  • Emulated environments: Wrapped Gymnasium or PettingZoo environments
Native PufferEnvs handle all agents in a single Python instance and write directly to shared memory buffers, eliminating serialization overhead.

2. Vectorization layer

The vectorization layer parallelizes environment execution across multiple processes or threads. PufferLib provides three backends:
  • Serial: Single-process execution for debugging and simple use cases
  • Multiprocessing: Parallel execution across CPU cores with zero-copy shared memory
  • Ray: Distributed execution across machines
This layer is responsible for:
  • Managing worker processes
  • Coordinating data transfer between workers
  • Batching observations and actions
  • Handling asynchronous execution

3. Training layer

The training layer (PufferTank) sits on top of vectorization and handles:
  • Policy networks and value functions
  • Learning algorithms (PPO, etc.)
  • Experience collection and replay buffers
  • Optimization and gradient updates
  • Logging and checkpointing
While this documentation focuses on the environment and vectorization layers, PufferTank completes the stack by providing production-ready training implementations.

Data flow

Here’s how data flows through the architecture:
  1. Policy inference: The policy network generates actions for a batch of observations
  2. Action distribution: Actions are sent to worker processes via shared memory or pipes
  3. Environment execution: Each worker steps its environments with the provided actions
  4. Observation collection: New observations are written to shared buffers
  5. Batch assembly: The vectorization layer assembles observations into training batches
  6. Experience processing: The training layer processes experiences and updates the policy
# Simplified data flow example
import pufferlib
import pufferlib.vector

# Create vectorized environments
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=8,
    backend=pufferlib.vector.Multiprocessing
)

# Reset to get initial observations
observations, infos = vecenv.reset()

# Training loop
while training:
    # 1. Policy generates actions from observations
    actions = policy(observations)
    
    # 2. Step environments
    observations, rewards, terminals, truncations, infos = vecenv.step(actions)
    
    # 3. Process experiences for learning
    train(observations, rewards, terminals, truncations)

Memory management

PufferLib uses shared memory buffers to minimize data copying:
  • Pre-allocated buffers: All observation, action, reward, and done arrays are pre-allocated
  • In-place updates: Environments write directly to shared buffers
  • Zero-copy transfer: Data moves between processes without serialization
This design is critical for achieving high throughput:
class MyEnv(pufferlib.PufferEnv):
    def step(self, actions):
        # Write directly to pre-allocated buffers
        self.observations[:] = compute_observations()
        self.rewards[:] = compute_rewards()
        self.terminals[:] = compute_terminals()
        
        return self.observations, self.rewards, self.terminals, self.truncations, infos

Space handling

PufferLib standardizes how observation and action spaces work:
  • single_observation_space: The observation space for one agent
  • single_action_space: The action space for one agent
  • observation_space: Joint space for all agents (automatically computed)
  • action_space: Joint action space for all agents (automatically computed)
This dual-space system makes it easy to work with both single and multi-agent environments:
class MultiAgentEnv(pufferlib.PufferEnv):
    def __init__(self, buf=None, seed=0):
        # Define spaces for a single agent
        self.single_observation_space = gymnasium.spaces.Box(low=0, high=1, shape=(4,))
        self.single_action_space = gymnasium.spaces.Discrete(3)
        self.num_agents = 16  # Environment has 16 agents
        
        # PufferEnv automatically creates joint spaces:
        # observation_space: Box(shape=(16, 4))
        # action_space: MultiDiscrete([3] * 16)
        super().__init__(buf)

Extension points

The architecture provides several extension points:
  • Custom environments: Subclass PufferEnv for native environments
  • Custom vectorization: Implement new backends for specialized hardware
  • Emulation layers: Add support for new environment standards
  • Wrappers: Modify environment behavior without changing the core
This modularity allows you to customize any layer while maintaining compatibility with the rest of the stack.

Build docs developers (and LLMs) love