PufferLib’s vectorization layer runs multiple environment instances in parallel, dramatically increasing training throughput. The pufferlib.vector.make() function creates vectorized environments with different backends optimized for various use cases.
Vector backends
PufferLib provides three vectorization backends:
# Single-process execution
vecenv = pufferlib.vector.make(
env_creator,
num_envs=4,
backend=pufferlib.vector.Serial
)
Serial backend
The Serial backend runs all environments sequentially in a single process. It’s useful for:
- Debugging environment implementations
- Development on machines with limited cores
- Environments with low computational cost
- Testing before scaling to multiprocessing
examples/vectorization.py
import pufferlib.vector
serial_vecenv = pufferlib.vector.make(
SamplePufferEnv,
num_envs=2,
backend=pufferlib.vector.Serial
)
observations, infos = serial_vecenv.reset()
actions = serial_vecenv.action_space.sample()
o, r, d, t, i = serial_vecenv.step(actions)
Serial implementation details
The Serial backend creates a list of environment instances and steps them sequentially:
class Serial:
def __init__(self, env_creators, env_args, env_kwargs, num_envs, buf=None, seed=0, **kwargs):
self.driver_env = env_creators[0](*env_args[0], **env_kwargs[0])
self.agents_per_batch = self.driver_env.num_agents * num_envs
# Pre-allocate shared buffers
set_buffers(self, buf)
# Create environments with buffer slices
self.envs = []
ptr = 0
for i in range(num_envs):
end = ptr + self.driver_env.num_agents
buf_i = dict(
observations=self.observations[ptr:end],
rewards=self.rewards[ptr:end],
terminals=self.terminals[ptr:end],
truncations=self.truncations[ptr:end],
masks=self.masks[ptr:end],
actions=self.actions[ptr:end]
)
env = env_creators[i](*env_args[i], buf=buf_i, **env_kwargs[i])
self.envs.append(env)
ptr = end
Even in Serial mode, environments write to shared buffers, making it easy to switch backends without code changes.
Multiprocessing backend
The Multiprocessing backend is the workhorse of PufferLib. It runs environments in parallel across CPU cores with zero-copy shared memory:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128, # Total number of environments
num_workers=8, # Number of worker processes
batch_size=32, # Environments per training batch
backend=pufferlib.vector.Multiprocessing
)
Key parameters
- num_envs: Total number of environment instances to run
- num_workers: Number of parallel worker processes (typically = CPU cores)
- batch_size: Number of environments to collect before returning data
- zero_copy: Enable zero-copy mode (requires
num_envs % batch_size == 0)
- overwork: Allow
num_workers > cpu_cores (disabled by default)
Shared memory architecture
Multiprocessing uses shared memory buffers to avoid data serialization:
from multiprocessing import RawArray
self.shm = dict(
observations=RawArray(obs_ctype, num_agents * int(np.prod(obs_shape))),
actions=RawArray(atn_ctype, num_agents * int(np.prod(atn_shape))),
rewards=RawArray('f', num_agents),
terminals=RawArray('b', num_agents),
truncateds=RawArray('b', num_agents),
masks=RawArray('b', num_agents),
semaphores=RawArray('c', num_workers),
notify=RawArray('b', num_workers),
)
Worker processes access these buffers directly:
buf = dict(
observations=np.ndarray((*shape, *obs_shape),
dtype=obs_dtype, buffer=shm['observations'])[worker_idx],
rewards=np.ndarray(shape, dtype=np.float32, buffer=shm['rewards'])[worker_idx],
# ...
)
Shared memory eliminates serialization overhead. Data written by workers is instantly visible to the main process without copying.
Synchronization modes
PufferLib supports three synchronization strategies:
1. Full sync (batch_size = num_envs)
Wait for all workers before returning data:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128,
num_workers=8,
batch_size=128, # Same as num_envs
backend=pufferlib.vector.Multiprocessing
)
Pros: Predictable timing, easy to reason about
Cons: Slowest worker determines throughput
2. Partial sync (zero_copy=True)
Wait for contiguous blocks of workers:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128,
num_workers=8,
batch_size=32,
zero_copy=True, # Requires num_envs % batch_size == 0
backend=pufferlib.vector.Multiprocessing
)
Pros: Lower latency than full sync, zero-copy efficiency
Cons: Still waits for contiguous worker blocks
3. Full async (zero_copy=False)
Return data from any available workers:
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128,
num_workers=8,
batch_size=32,
zero_copy=False, # Allow non-contiguous workers
backend=pufferlib.vector.Multiprocessing
)
Pros: Minimum latency, maximum throughput
Cons: Small copy overhead for non-contiguous data
Async API
Multiprocessing supports an async API for maximum control:
examples/vectorization.py
vecenv = pufferlib.vector.make(
SamplePufferEnv,
num_envs=2,
num_workers=2,
batch_size=1,
backend=pufferlib.vector.Multiprocessing
)
# Async reset
vecenv.async_reset()
o, r, d, t, i, env_ids, masks = vecenv.recv()
# Async step
actions = vecenv.action_space.sample()
vecenv.send(actions)
# Do other work here (e.g., policy inference)
# while environments run in the background
# Get results when ready
o, r, d, t, i, env_ids, masks = vecenv.recv()
The async API returns additional data:
- env_ids: Which environments produced this batch
- masks: Which agents are active (for variable-agent environments)
Here’s when to use each backend:
Serial
Multiprocessing
Ray
Use when:
- Debugging environment code
- Environments are very fast (< 0.1ms per step)
- Single-core machines
- Development and testing
Performance:
- No parallelism overhead
- Easy to profile and debug
- Linear scaling with num_envs
Use when:
- Environments take > 1ms per step
- Multi-core CPU available
- Maximum throughput needed
- Production training
Performance:
- Near-linear scaling up to physical cores
- ~10-100x speedup on 8-16 core machines
- Zero-copy mode minimizes overhead
- Best for CPU-bound environments
Use when:
- Distributed training across machines
- Very expensive environments (> 100ms)
- Cluster resources available
- Horizontal scaling needed
Performance:
- Scales across machines
- Higher overhead than Multiprocessing
- Best for expensive simulations
Passing arguments to environments
You can pass arguments to environment constructors in several ways:
Same arguments for all environments
examples/vectorization.py
vecenv = pufferlib.vector.make(
SamplePufferEnv,
num_envs=2,
backend=pufferlib.vector.Serial,
env_args=[3], # Positional args
env_kwargs={'bar': 4} # Keyword args
)
Different arguments per environment
examples/vectorization.py
vecenv = pufferlib.vector.make(
[SamplePufferEnv, SamplePufferEnv], # List of creators
num_envs=2,
backend=pufferlib.vector.Serial,
env_args=[[3], [4]], # Different args per env
env_kwargs=[{'bar': 4}, {'bar': 5}] # Different kwargs per env
)
Autotune
PufferLib includes an autotune function to find optimal vectorization parameters:
configs = pufferlib.vector.autotune(
env_creator,
batch_size=128,
max_envs=256,
time_per_test=5
)
Autotune profiles your environment and tests different configurations to find:
- Optimal
num_envs
- Best
num_workers setting
- Whether
zero_copy helps
- Expected throughput (steps per second)
Run autotune once per environment to determine the best configuration for your hardware. Results vary based on environment complexity and CPU architecture.
Common patterns
Maximizing throughput
import psutil
vecenv = pufferlib.vector.make(
env_creator,
num_envs=128,
num_workers=psutil.cpu_count(logical=False), # Physical cores
batch_size=128,
zero_copy=True,
backend=pufferlib.vector.Multiprocessing
)
Minimizing latency
vecenv = pufferlib.vector.make(
env_creator,
num_envs=32,
num_workers=8,
batch_size=8, # Small batches
zero_copy=False, # Full async
backend=pufferlib.vector.Multiprocessing
)
Development and debugging
vecenv = pufferlib.vector.make(
env_creator,
num_envs=1,
backend=pufferlib.vector.Serial # Easy to debug
)