Overview
ThePPOConfig dataclass defines all hyperparameters for Proximal Policy Optimization (PPO) training. These parameters control the learning dynamics, advantage estimation, clipping behavior, and training batch configuration.
Core PPO Hyperparameters
Learning Rate
Adam optimizer learning rate for all network components (encoder, decoder, value network).Controls the step size for gradient descent updates. Lower values provide more stable but slower learning.
Discount Factor (Gamma)
Reward discount factor for future rewards.Determines how much the agent values future rewards versus immediate rewards. Values closer to 1.0 prioritize long-term planning.
In
training_server.py, this is increased to 0.997 for longer-horizon planning in DOOM survival scenarios.Generalized Advantage Estimation (GAE)
Lambda parameter for GAE advantage estimation.Controls the bias-variance tradeoff in advantage estimation:
1.0= high variance, low bias (uses full Monte Carlo returns)0.0= low variance, high bias (uses only 1-step TD)0.95= recommended balanced setting
Clipping and Loss Coefficients
PPO policy clipping parameter.Limits the size of policy updates to prevent destructively large changes. The policy ratio is clipped to
[1 - epsilon, 1 + epsilon].Coefficient for value function loss in total loss calculation.Total loss = policy_loss +
value_loss_coef * value_loss + entropy_coef * entropy_lossEntropy bonus coefficient for policy exploration.Encourages exploration by penalizing overly deterministic policies. Higher values increase randomness in action selection.
Gradient Clipping
Maximum gradient norm for gradient clipping.Prevents exploding gradients by clipping the global norm of gradients to this value.
Return Normalization
Whether to normalize advantage estimates and returns.Normalizes advantages to have zero mean and unit variance, which can stabilize training.
In
ppo_doom.py: “Leave this on for the most part, stabilizes the critic, maybe a running norm would be better?”In training_server.py: Set to False as part of DOOM Initial Report tuning.Training Configuration
Batch and Episode Settings
Number of parallel environments for data collection.Currently only single environment is supported due to hardware constraints.
Number of environment steps collected before each PPO update.Total samples per update =
num_envs × steps_per_updateMinibatch size for SGD updates during PPO epochs.The collected
steps_per_update samples are divided into minibatches of this size for optimization.Number of optimization epochs per PPO update.How many times to iterate over the collected batch of experience. More epochs can improve sample efficiency but risk overfitting to old data.
Maximum number of training episodes before termination.
Example Configurations
Conservative Training
Aggressive Exploration
Long-Horizon Survival (DOOM)
Related Configuration
Encoder/Decoder
Network architecture settings
Action Spaces
Action space configuration