PPOConfig
ThePPOConfig dataclass defines all configuration parameters for PPO reinforcement learning training and the CL1 neural hardware interface.
Environment Configuration
Path to the VizDoom configuration file that defines the game scenario
Screen resolution for the DOOM game buffer. Valid values are VizDoom resolution constants like
RES_320X240, RES_640X480Whether to enable the screen buffer for visual observations
Maximum absolute degrees for TURN_LEFT_RIGHT_DELTA action
Discrete turn step size in degrees when using turn buttons
Initial standard deviation (in degrees) for camera delta distribution
Toggle for single categorical action space vs. combinatorial action space
Neural Interface - Channel Configuration
Total number of channels available on the CL1 hardware
Channel indices used for encoding game state into stimulation patterns
Channel indices that decode to forward movement actions
Channel indices that decode to backward movement actions
Channel indices that decode to strafe left actions
Channel indices that decode to strafe right actions
Channel indices that decode to turn left actions
Channel indices that decode to turn right actions
Channel indices that decode to attack/fire actions
Stimulation Design Parameters
Parameters used to constructcl.StimDesign objects for biphasic electrical stimulation.
Duration of the first (negative) phase in microseconds (μs)
Duration of the second (positive) phase in microseconds (μs)
Minimum stimulation amplitude in microamps (μA). Used as the magnitude for phase1 (negative)
Maximum stimulation amplitude in microamps (μA). Used as the magnitude for phase2 (positive)
Burst Design Parameters
Parameters used to constructcl.BurstDesign objects that define stimulation frequency and duration.
Minimum burst frequency in Hertz (Hz)
Maximum burst frequency in Hertz (Hz)
Number of pulses per burst. Set to 500 to ensure stimulation lasts between game ticks
PPO Hyperparameters
Learning rate for the Adam optimizer
Discount factor for future rewards
Lambda parameter for Generalized Advantage Estimation (GAE)
Clipping parameter for PPO policy updates
Coefficient for value function loss in the total loss
Coefficient for entropy bonus to encourage exploration
Maximum gradient norm for gradient clipping. Can be reduced to 1 or 0.5 for more conservative updates
Whether to normalize returns for critic training. Stabilizes the critic but may affect learning dynamics
Training Configuration
Number of parallel environments to run
Number of environment steps to collect per policy update (per environment)
Minibatch size for PPO updates
Number of epochs to train on each batch of collected experience
Maximum number of episodes to train for
Whether to use CL1 hardware or run in simulation mode
Network Architecture
Hidden layer size for encoder, decoder, and value networks
Logging and Checkpointing
Directory for TensorBoard logs
Directory for saving model checkpoints
Save checkpoint every N episodes
Evaluate policy every N episodes
Reward Shaping
Reward threshold above which positive feedback is triggered. Requires tuning
Reward threshold below which negative feedback is triggered. Requires tuning
Terminal reward bonus for armor-related objectives. Requires tuning
Gain multiplier for aim alignment reward shaping
Maximum distance at which aim alignment reward is computed
Bonus reward for accurate aim alignment
Angle threshold in degrees for aim alignment bonus
Scaling factor for movement velocity rewards
Use simplified reward function. When false, factors in manually shaped aim alignment and velocity rewards
Feedback Stimulation - Step-level Rewards
Stimulation amplitude (μA) for positive step-level feedback
Stimulation frequency (Hz) for positive step-level feedback
Number of pulses for positive step-level feedback
Stimulation amplitude (μA) for negative step-level feedback
Stimulation frequency (Hz) for negative step-level feedback
Number of pulses for negative step-level feedback
Feedback Stimulation - Episode-level Rewards
Number of pulses for positive episode-level feedback
Stimulation frequency (Hz) for positive episode-level feedback
Number of pulses for negative episode-level feedback
Stimulation frequency (Hz) for negative episode-level feedback
Whether to provide feedback only at episode end (disables step-level feedback)
Reward Feedback Channels
Whether to enable reward-based feedback stimulation
Channel indices for positive reward feedback
Channel indices for negative reward feedback
Event-based Feedback
Distance threshold for triggering movement-based events
Dictionary mapping event names to EventFeedbackConfig objects. Contains configurations for:
enemy_kill: Feedback when killing an enemyarmor_pickup: Feedback when collecting armortook_damage: Feedback when taking damageammo_waste: Feedback when wasting ammoapproach_target: Feedback when moving closer to targetretreat_target: Feedback when moving away from target
Decoder Configuration
Whether to enforce non-negative weights in decoder linear readout heads. Experimental - requires testing
Whether to freeze decoder weights during training. Experimental - requires testing
Whether to zero out decoder biases. Recommended to be true - bias can cause decoder to generate its own predictions
Whether to use MLP decoder instead of linear readout. Prefer false - MLP can learn to play the game instead of relying on neural responses
Hidden layer size for MLP decoder. Only used if
decoder_use_mlp is true. Experimental value - requires testingL2 regularization coefficient for decoder weights. Untuned
L2 regularization coefficient for decoder biases. Untuned
Ablation mode for testing decoder behavior. Valid values:
none: Normal operationrandom: Replace spike features with random valueszero: Replace spike features with zeros
Wall Detection
Number of raycasts for wall detection. Probably less necessary with CNN encoder
Maximum range for wall detection raycasts. Keep as is
Maximum depth distance for wall detection normalization. Already calibrated - keep as is
Encoder Configuration
Whether encoder network weights are trainable. Recommended to be true for reasonable PPO policy gradients, especially with
decoder_use_mlp: falseEntropy penalty coefficient for the encoder (uses Beta distribution sampling)
Whether to use CNN for visual feature extraction. Testing shows CNN does not overfit/learn on its own - useful to keep true
Number of base channels for CNN encoder. Arbitrary value - can be adjusted
Downsampling factor for CNN input. Arbitrary value - can be adjusted
Episode Feedback Events
Specific event name to use for positive episode feedback. If None, uses overall episode reward
Specific event name to use for negative episode feedback. If None, uses overall episode reward
Surprise-based Feedback Scaling
Gain for surprise-based feedback scaling. Tune as needed based on neuron responses
Maximum scaling factor for surprise-based feedback. Tune as needed based on neuron responses
Frequency gain for surprise-based feedback. Tune as needed based on neuron responses
Amplitude gain for surprise-based feedback. Tune as needed based on neuron responses
Maximum frequency scale for surprise-based feedback. Tune as needed based on neuron responses
Maximum amplitude scale for surprise-based feedback. Tune as needed based on neuron responses
Distance Normalization
Normalization constant for enemy distance features. Already calibrated - do not change