Design philosophy
PufferLib was created to solve common pain points in RL development:- Compatibility: Work seamlessly with any environment standard (Gymnasium, PettingZoo, or native)
- Performance: Achieve maximum throughput through optimized vectorization and zero-copy operations
- Simplicity: Provide a clean API that abstracts complexity while maintaining flexibility
- Scalability: Support everything from single-process training to distributed clusters
Three-layer architecture
PufferLib’s architecture consists of three distinct layers that work together:1. Environment layer
The foundation of PufferLib is thePufferEnv base class, which defines a standardized interface for all environments. This layer handles:
- Observation and action spaces: Defines what agents can see and do
- Agent management: Supports single and multi-agent environments
- State updates: Manages environment state transitions
- Shared memory buffers: Enables zero-copy vectorization
- Native PufferEnvs: Built directly with PufferLib for maximum performance
- Emulated environments: Wrapped Gymnasium or PettingZoo environments
Native PufferEnvs handle all agents in a single Python instance and write directly to shared memory buffers, eliminating serialization overhead.
2. Vectorization layer
The vectorization layer parallelizes environment execution across multiple processes or threads. PufferLib provides three backends:- Serial: Single-process execution for debugging and simple use cases
- Multiprocessing: Parallel execution across CPU cores with zero-copy shared memory
- Ray: Distributed execution across machines
- Managing worker processes
- Coordinating data transfer between workers
- Batching observations and actions
- Handling asynchronous execution
3. Training layer
The training layer (PufferTank) sits on top of vectorization and handles:- Policy networks and value functions
- Learning algorithms (PPO, etc.)
- Experience collection and replay buffers
- Optimization and gradient updates
- Logging and checkpointing
While this documentation focuses on the environment and vectorization layers, PufferTank completes the stack by providing production-ready training implementations.
Data flow
Here’s how data flows through the architecture:- Policy inference: The policy network generates actions for a batch of observations
- Action distribution: Actions are sent to worker processes via shared memory or pipes
- Environment execution: Each worker steps its environments with the provided actions
- Observation collection: New observations are written to shared buffers
- Batch assembly: The vectorization layer assembles observations into training batches
- Experience processing: The training layer processes experiences and updates the policy
Memory management
PufferLib uses shared memory buffers to minimize data copying:- Pre-allocated buffers: All observation, action, reward, and done arrays are pre-allocated
- In-place updates: Environments write directly to shared buffers
- Zero-copy transfer: Data moves between processes without serialization
Space handling
PufferLib standardizes how observation and action spaces work:- single_observation_space: The observation space for one agent
- single_action_space: The action space for one agent
- observation_space: Joint space for all agents (automatically computed)
- action_space: Joint action space for all agents (automatically computed)
Extension points
The architecture provides several extension points:- Custom environments: Subclass
PufferEnvfor native environments - Custom vectorization: Implement new backends for specialized hardware
- Emulation layers: Add support for new environment standards
- Wrappers: Modify environment behavior without changing the core