Vectorization

PufferLib’s vectorization layer runs multiple environment instances in parallel, dramatically increasing training throughput. The pufferlib.vector.make() function creates vectorized environments with different backends optimized for various use cases.

Vector backends

PufferLib provides three vectorization backends:

# Single-process execution
vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=4,
    backend=pufferlib.vector.Serial
)

Serial backend

The Serial backend runs all environments sequentially in a single process. It’s useful for:

Debugging environment implementations
Development on machines with limited cores
Environments with low computational cost
Testing before scaling to multiprocessing

examples/vectorization.py

import pufferlib.vector

serial_vecenv = pufferlib.vector.make(
    SamplePufferEnv,
    num_envs=2,
    backend=pufferlib.vector.Serial
)

observations, infos = serial_vecenv.reset()
actions = serial_vecenv.action_space.sample()
o, r, d, t, i = serial_vecenv.step(actions)

Serial implementation details

The Serial backend creates a list of environment instances and steps them sequentially:

pufferlib/vector.py

class Serial:
    def __init__(self, env_creators, env_args, env_kwargs, num_envs, buf=None, seed=0, **kwargs):
        self.driver_env = env_creators[0](*env_args[0], **env_kwargs[0])
        self.agents_per_batch = self.driver_env.num_agents * num_envs
        
        # Pre-allocate shared buffers
        set_buffers(self, buf)
        
        # Create environments with buffer slices
        self.envs = []
        ptr = 0
        for i in range(num_envs):
            end = ptr + self.driver_env.num_agents
            buf_i = dict(
                observations=self.observations[ptr:end],
                rewards=self.rewards[ptr:end],
                terminals=self.terminals[ptr:end],
                truncations=self.truncations[ptr:end],
                masks=self.masks[ptr:end],
                actions=self.actions[ptr:end]
            )
            env = env_creators[i](*env_args[i], buf=buf_i, **env_kwargs[i])
            self.envs.append(env)
            ptr = end

Even in Serial mode, environments write to shared buffers, making it easy to switch backends without code changes.

Multiprocessing backend

The Multiprocessing backend is the workhorse of PufferLib. It runs environments in parallel across CPU cores with zero-copy shared memory:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,           # Total number of environments
    num_workers=8,          # Number of worker processes
    batch_size=32,          # Environments per training batch
    backend=pufferlib.vector.Multiprocessing
)

Key parameters

num_envs: Total number of environment instances to run
num_workers: Number of parallel worker processes (typically = CPU cores)
batch_size: Number of environments to collect before returning data
zero_copy: Enable zero-copy mode (requires num_envs % batch_size == 0)
overwork: Allow num_workers > cpu_cores (disabled by default)

Shared memory architecture

Multiprocessing uses shared memory buffers to avoid data serialization:

pufferlib/vector.py

from multiprocessing import RawArray

self.shm = dict(
    observations=RawArray(obs_ctype, num_agents * int(np.prod(obs_shape))),
    actions=RawArray(atn_ctype, num_agents * int(np.prod(atn_shape))),
    rewards=RawArray('f', num_agents),
    terminals=RawArray('b', num_agents),
    truncateds=RawArray('b', num_agents),
    masks=RawArray('b', num_agents),
    semaphores=RawArray('c', num_workers),
    notify=RawArray('b', num_workers),
)

Worker processes access these buffers directly:

buf = dict(
    observations=np.ndarray((*shape, *obs_shape),
        dtype=obs_dtype, buffer=shm['observations'])[worker_idx],
    rewards=np.ndarray(shape, dtype=np.float32, buffer=shm['rewards'])[worker_idx],
    # ...
)

Shared memory eliminates serialization overhead. Data written by workers is instantly visible to the main process without copying.

Synchronization modes

PufferLib supports three synchronization strategies:

1. Full sync (batch_size = num_envs)

Wait for all workers before returning data:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=8,
    batch_size=128,  # Same as num_envs
    backend=pufferlib.vector.Multiprocessing
)

Pros: Predictable timing, easy to reason about Cons: Slowest worker determines throughput

2. Partial sync (zero_copy=True)

Wait for contiguous blocks of workers:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=8,
    batch_size=32,
    zero_copy=True,  # Requires num_envs % batch_size == 0
    backend=pufferlib.vector.Multiprocessing
)

Pros: Lower latency than full sync, zero-copy efficiency Cons: Still waits for contiguous worker blocks

3. Full async (zero_copy=False)

Return data from any available workers:

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=8,
    batch_size=32,
    zero_copy=False,  # Allow non-contiguous workers
    backend=pufferlib.vector.Multiprocessing
)

Pros: Minimum latency, maximum throughput Cons: Small copy overhead for non-contiguous data

Async API

Multiprocessing supports an async API for maximum control:

examples/vectorization.py

vecenv = pufferlib.vector.make(
    SamplePufferEnv,
    num_envs=2,
    num_workers=2,
    batch_size=1,
    backend=pufferlib.vector.Multiprocessing
)

# Async reset
vecenv.async_reset()
o, r, d, t, i, env_ids, masks = vecenv.recv()

# Async step
actions = vecenv.action_space.sample()
vecenv.send(actions)

# Do other work here (e.g., policy inference)
# while environments run in the background

# Get results when ready
o, r, d, t, i, env_ids, masks = vecenv.recv()

The async API returns additional data:

env_ids: Which environments produced this batch
masks: Which agents are active (for variable-agent environments)

Performance characteristics

Here’s when to use each backend:

Serial
Multiprocessing
Ray

Use when:

Debugging environment code
Environments are very fast (< 0.1ms per step)
Single-core machines
Development and testing

Performance:

No parallelism overhead
Easy to profile and debug
Linear scaling with num_envs

Passing arguments to environments

You can pass arguments to environment constructors in several ways:

Same arguments for all environments

examples/vectorization.py

vecenv = pufferlib.vector.make(
    SamplePufferEnv,
    num_envs=2,
    backend=pufferlib.vector.Serial,
    env_args=[3],              # Positional args
    env_kwargs={'bar': 4}      # Keyword args
)

Different arguments per environment

examples/vectorization.py

vecenv = pufferlib.vector.make(
    [SamplePufferEnv, SamplePufferEnv],  # List of creators
    num_envs=2,
    backend=pufferlib.vector.Serial,
    env_args=[[3], [4]],                 # Different args per env
    env_kwargs=[{'bar': 4}, {'bar': 5}]  # Different kwargs per env
)

Autotune

PufferLib includes an autotune function to find optimal vectorization parameters:

pufferlib/vector.py

configs = pufferlib.vector.autotune(
    env_creator,
    batch_size=128,
    max_envs=256,
    time_per_test=5
)

Autotune profiles your environment and tests different configurations to find:

Optimal num_envs
Best num_workers setting
Whether zero_copy helps
Expected throughput (steps per second)

Run autotune once per environment to determine the best configuration for your hardware. Results vary based on environment complexity and CPU architecture.

Common patterns

Maximizing throughput

import psutil

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=128,
    num_workers=psutil.cpu_count(logical=False),  # Physical cores
    batch_size=128,
    zero_copy=True,
    backend=pufferlib.vector.Multiprocessing
)

Minimizing latency

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=32,
    num_workers=8,
    batch_size=8,        # Small batches
    zero_copy=False,     # Full async
    backend=pufferlib.vector.Multiprocessing
)

Development and debugging

vecenv = pufferlib.vector.make(
    env_creator,
    num_envs=1,
    backend=pufferlib.vector.Serial  # Easy to debug
)

Getting Started

Core Concepts

Training

Environment Wrappers

Ocean Environments

Advanced

Examples

Vector backends

Serial backend

Serial implementation details

Multiprocessing backend

Key parameters

Shared memory architecture

Synchronization modes

1. Full sync (batch_size = num_envs)

2. Partial sync (zero_copy=True)

3. Full async (zero_copy=False)

Async API

Performance characteristics

Passing arguments to environments

Same arguments for all environments

Different arguments per environment

Autotune

Common patterns

Maximizing throughput

Minimizing latency

Development and debugging

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Environment Wrappers

Ocean Environments

Advanced

Examples

​Vector backends

​Serial backend

​Serial implementation details

​Multiprocessing backend

​Key parameters

​Shared memory architecture

​Synchronization modes

​1. Full sync (batch_size = num_envs)

​2. Partial sync (zero_copy=True)

​3. Full async (zero_copy=False)

​Async API

​Performance characteristics

​Passing arguments to environments

​Same arguments for all environments

​Different arguments per environment

​Autotune

​Common patterns

​Maximizing throughput

​Minimizing latency

​Development and debugging

Build docs developers (and LLMs) love

Vector backends

Serial backend

Serial implementation details

Multiprocessing backend

Key parameters

Shared memory architecture

Synchronization modes

1. Full sync (batch_size = num_envs)

2. Partial sync (zero_copy=True)

3. Full async (zero_copy=False)

Async API

Performance characteristics

Passing arguments to environments

Same arguments for all environments

Different arguments per environment

Autotune

Common patterns

Maximizing throughput

Minimizing latency

Development and debugging