Configuration structure
Configuration is organized into sections:[base]- Package and environment selection[vec]- Vectorization settings[env]- Environment-specific parameters[policy]- Policy network architecture[rnn]- Recurrent network wrapper settings[train]- Training hyperparameters[sweep]- Hyperparameter sweep configuration
Loading configuration
From environment name
From custom file
fill_in_default=False to use only your config without merging defaults.
Modifying in code
Default configuration
Here’s the default configuration fromconfig/default.ini:
Configuration options
Base section
Environment package name (e.g., ‘atari’, ‘procgen’, ‘ocean’).
Specific environment name within the package.
Policy class name to use (e.g., ‘Policy’, ‘Convolutional’, ‘ProcgenResnet’).
RNN wrapper class name (e.g., ‘LSTMWrapper’). Set to None to disable recurrence.
Vectorization section
Vectorization backend. Options:
Serial, Multiprocessing, Ray, PufferEnv.Number of parallel environment processes.
Number of worker threads. Set to
auto to match num_envs.Batch size for vectorized environments. Set to
auto for automatic sizing.Enable zero-copy shared memory for faster data transfer.
Random seed for environment initialization.
Training section
Core settings
Device for training. Options:
cuda, cpu.Total number of environment steps to train for.
Total batch size for training. Auto-calculated from
num_envs * bptt_horizon.Backpropagation through time horizon (sequence length).
Size of minibatches for gradient updates.
Maximum minibatch size before gradient accumulation kicks in.
Optimizer settings
Optimizer to use. Options:
adam, muon.Learning rate for optimizer.
Enable cosine learning rate annealing.
Minimum learning rate as a ratio of the initial learning rate.
Adam beta1 parameter (momentum).
Adam beta2 parameter (RMSprop).
Adam epsilon for numerical stability.
Maximum gradient norm for clipping.
PPO hyperparameters
Discount factor for rewards.
Lambda parameter for Generalized Advantage Estimation.
Number of epochs to update the policy per batch.
Clipping coefficient for PPO surrogate loss.
Coefficient for value function loss.
Clipping coefficient for value function.
Entropy coefficient for exploration.
V-trace parameters
Clipping threshold for V-trace importance sampling ratio.
Clipping threshold for V-trace trace coefficient.
Prioritization parameters
Prioritization exponent (0 = uniform, 1 = full prioritization).
Initial importance sampling correction factor.
Performance settings
Training precision. Options:
float32, bfloat16.Enable torch.compile for policy and sampling functions.
Torch compile mode. Options:
default, reduce-overhead, max-autotune, max-autotune-no-cudagraphs.Offload observation buffers to CPU to save GPU memory.
Checkpointing
Directory for saving checkpoints and logs.
Save checkpoint every N epochs.
Miscellaneous
Random seed for reproducibility.
Enable deterministic CUDA operations.
Command-line arguments
All configuration options can be overridden via command-line arguments:- Section options:
--section.key value - Base options:
--key value - Underscores become hyphens:
learning_rate→--train.learning-rate
Creating custom configs
Create a custom.ini file:
my_config.ini
Auto values
Some parameters supportauto for automatic configuration:
batch_size = auto- Calculated asnum_envs * bptt_horizonbptt_horizon = auto- Calculated asbatch_size / num_envsnum_workers = auto- Set to matchnum_envs