Overview
Beyond the encoder-decoder loop, the DOOM Neuron system provides event-based feedback stimulation to biological neurons. This feedback acts as an auxiliary teaching signal, delivering reward/punishment information through dedicated neural channels based on game events and temporal-difference (TD) prediction errors.
Feedback Architecture
The feedback system operates in parallel to the main encoder-decoder loop:
┌─────────────────────────────────────────────────────────┐
│ Training System │
│ │
│ Game Events ──▶ Surprise Calculation ──▶ Feedback │
│ (kills, damage) (TD error scaling) Commands │
│ │
│ │ │
│ ▼ │
│ UDP Feedback Port │
│ (12348) │
└─────────────────────────────────────────────────────────┘
│
│ UDP packets
▼
┌─────────────────────────────────────────────────────────┐
│ CL1 Device │
│ │
│ Feedback Socket │
│ │ │
│ ▼ │
│ apply_feedback_command() │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ Reward Channels Event Channels Interrupt │
│ [19, 20, 22] [35, 36, 38] (stop stim) │
│ [23, 24, 26] [44, 47, 48] │
│ [39, 40, 43] │
│ [52, 54, 55] │
│ [5, 6, 11] │
│ [12, 15, 16] │
└─────────────────────────────────────────────────────────┘
Feedback Channel Types
Reward Feedback Channels
Dedicated channels for general positive/negative reward signals:
# From PPOConfig in training_server.py:182-183
reward_feedback_positive_channels = [ 19 , 20 , 22 ] # 3 channels
reward_feedback_negative_channels = [ 23 , 24 , 26 ] # 3 channels
Purpose : Provide binary reward/punishment feedback based on step-level rewards:
if reward > feedback_positive_threshold: # Default: +1
# Stimulate positive reward channels
send_feedback( channels = [ 19 , 20 , 22 ], frequency = 20Hz , amplitude = 2. 0μA )
elif reward < feedback_negative_threshold: # Default: -1
# Stimulate negative reward channels
send_feedback( channels = [ 23 , 24 , 26 ], frequency = 60Hz , amplitude = 2. 0μA )
Reward feedback channels are separate from encoding/action channels to avoid confounding the encoder-decoder learning. The decoder doesn’t read from these channels.
Event Feedback Channels
Dedicated channels for specific game events with surprise-based scaling:
# From PPOConfig.event_feedback_settings in training_server.py:185-249
event_feedback_settings = {
'enemy_kill' : EventFeedbackConfig(
channels = [ 35 , 36 , 38 ],
base_frequency = 20.0 ,
base_amplitude = 2.5 ,
base_pulses = 40 ,
info_key = 'event_enemy_kill' ,
td_sign = 'positive' , # Reward event
freq_gain = 0.20 ,
freq_max_scale = 2.5 ,
),
'took_damage' : EventFeedbackConfig(
channels = [ 44 , 47 , 48 ],
base_frequency = 90.0 ,
base_amplitude = 2.2 ,
base_pulses = 50 ,
info_key = 'event_took_damage' ,
td_sign = 'negative' , # Punishment event
unpredictable = True , # Enable unpredictable stimulation
unpredictable_frequency = 5.0 ,
unpredictable_duration_sec = 4.0 ,
),
'armor_pickup' : EventFeedbackConfig(
channels = [ 39 , 40 , 43 ],
base_frequency = 20.0 ,
base_amplitude = 2.0 ,
base_pulses = 35 ,
td_sign = 'positive' ,
),
# ... more events ...
}
Event Types :
All Event Feedback Channels
Positive Events (Rewards):
enemy_kill [35, 36, 38]: Agent eliminates an enemy
armor_pickup [39, 40, 43]: Agent collects armor item
approach_target [5, 6, 11]: Agent moves closer to enemy
Negative Events (Punishments):
took_damage [44, 47, 48]: Agent receives damage
ammo_waste [52, 54, 55]: Agent shoots without hitting
retreat_target [12, 15, 16]: Agent moves away from enemy
Surprise Scaling
Temporal-Difference (TD) Error
Feedback intensity is modulated by surprise - how unexpected the event was:
# Conceptual implementation
td_error = reward + gamma * next_value - current_value
# For positive events (kills, armor)
if td_sign == 'positive' :
surprise = max ( 0 , td_error) # Positive surprise only
# For negative events (damage, ammo waste)
elif td_sign == 'negative' :
surprise = max ( 0 , - td_error) # Negative surprise (unexpected bad outcome)
# For magnitude-based scaling
elif td_sign == 'absolute' :
surprise = abs (td_error)
TD error measures prediction error:
Positive TD : Event better than expected (surprising reward)
Negative TD : Event worse than expected (surprising punishment)
Zero TD : Event perfectly predicted (no surprise)
By scaling feedback with TD error, the system emphasizes unexpected outcomes that provide the most learning value.
Surprise-Scaled Parameters
Feedback stimulation parameters scale with surprise magnitude:
# From EventFeedbackConfig fields
class EventFeedbackConfig :
# Base parameters (used when surprise = 0)
base_frequency: float = 20.0 # Hz
base_amplitude: float = 2.5 # μA
base_pulses: int = 40 # Number of pulses
# Surprise scaling gains
freq_gain: float = 0.20 # How much surprise affects frequency
amp_gain: float = 0.20 # How much surprise affects amplitude
pulse_gain: float = 0.20 # How much surprise affects pulse count
# Maximum scaling factors
freq_max_scale: float = 2.5 # Max frequency multiplier
amp_max_scale: float = 1.6 # Max amplitude multiplier
pulse_max_scale: float = 2.5 # Max pulse count multiplier
# Scaling computation
freq_scale = 1.0 + min (freq_gain * surprise, freq_max_scale - 1.0 )
amp_scale = 1.0 + min (amp_gain * surprise, amp_max_scale - 1.0 )
pulse_scale = 1.0 + min (pulse_gain * surprise, pulse_max_scale - 1.0 )
final_frequency = base_frequency * freq_scale
final_amplitude = base_amplitude * amp_scale
final_pulses = int (base_pulses * pulse_scale)
Example : Enemy kill with high surprise
# Low surprise (kill was expected)
td_error = 0.5
frequency = 20.0 * ( 1.0 + 0.20 * 0.5 ) = 22.0 Hz
amplitude = 2.5 * ( 1.0 + 0.20 * 0.5 ) = 2.75 μA
pulses = 40 * ( 1.0 + 0.20 * 0.5 ) = 44
# High surprise (unexpected kill)
td_error = 8.0
frequency = 20.0 * ( 1.0 + min ( 0.20 * 8.0 , 1.5 )) = 20.0 * 2.5 = 50.0 Hz
amplitude = 2.5 * ( 1.0 + min ( 0.20 * 8.0 , 0.6 )) = 2.5 * 1.6 = 4.0 μA
pulses = 40 * ( 1.0 + min ( 0.20 * 8.0 , 1.5 )) = 40 * 2.5 = 100
Surprise scaling acts as a natural curriculum: Early in training, most events are surprising (high TD errors), producing strong feedback. As the value function improves, only genuinely unexpected events trigger strong feedback.
Exponential Moving Average (EMA)
To stabilize surprise estimates, TD errors are smoothed over time:
# From EventFeedbackConfig
ema_beta: float = 0.99 # Smoothing factor
# Update formula
ema_td_error = ema_beta * ema_td_error + ( 1 - ema_beta) * current_td_error
surprise = abs (ema_td_error)
EMA prevents single outlier TD errors from dominating feedback:
ema_beta = 0.99: Heavy smoothing, slow adaptation
ema_beta = 0.90: Faster adaptation to changing predictions
Unpredictable Stimulation
Damage Aversion Learning
Certain negative events (like taking damage) use unpredictable stimulation to create aversion:
# From EventFeedbackConfig for 'took_damage'
unpredictable: bool = True
unpredictable_frequency: float = 5.0 # Hz (low frequency, irregular)
unpredictable_duration_sec: float = 4.0 # 4 seconds of stimulation
unpredictable_rest_sec: float = 4.0 # 4 seconds rest
unpredictable_channels: List[ int ] = [ 44 , 47 , 48 ]
unpredictable_amplitude: float = 2.2 # μA
Purpose : Create persistent, uncomfortable stimulation that the agent learns to avoid.
Mechanism :
Agent takes damage
Trigger unpredictable stimulation on damage channels
Low-frequency (5 Hz) irregular pulses for 4 seconds
Rest for 4 seconds
Repeat pattern if damage continues
Unpredictable stimulation differs from regular feedback:
Duration : Lasts seconds, not milliseconds
Pattern : Irregular, low-frequency (harder for neurons to adapt)
Channels : Same as event feedback but with different parameters
Goal : Aversion learning, not just event signaling
Feedback Command Protocol
# From udp_protocol.py:235-309
def pack_feedback_command (
feedback_type : str , # "interrupt", "event", or "reward"
channels : List[ int ], # Channel numbers to stimulate
frequency : int , # Hz
amplitude : float , # μA
pulses : int , # Number of pulses
unpredictable : bool , # Unpredictable stimulation flag
event_name : str # Event identifier
) -> bytes :
# Returns 120-byte binary packet
Packet Structure (120 bytes total):
[8 bytes] timestamp (microseconds)
[1 byte] feedback_type (0=interrupt, 1=event, 2=reward)
[1 byte] num_channels
[64 bytes] channel array (0xFF padding for unused)
[4 bytes] frequency (int)
[4 bytes] amplitude (float)
[4 bytes] pulses (int)
[1 byte] unpredictable flag (0 or 1)
[32 bytes] event_name (null-padded string)
[1 byte] padding
Feedback Types
Type 0: Interrupt feedback_type = "interrupt"
channels = [ 19 , 20 , 22 , 23 , 24 , 26 ] # All reward channels
frequency = 0
amplitude = 0
pulses = 0
Stops ongoing stimulation on specified channels. Used to clear feedback before new events. Type 1: Event Feedback feedback_type = "event"
channels = [ 35 , 36 , 38 ] # Enemy kill channels
frequency = 50 # Hz (surprise-scaled)
amplitude = 4.0 # μA (surprise-scaled)
pulses = 100 # (surprise-scaled)
event_name = "enemy_kill"
Delivers event-specific feedback with surprise scaling. Type 2: Reward Feedback feedback_type = "reward"
channels = [ 19 , 20 , 22 ] # Positive reward channels
frequency = 20 # Hz
amplitude = 2.0 # μA
pulses = 30
event_name = "positive_reward"
Delivers binary reward/punishment signals based on step rewards.
CL1 Feedback Application
Applying Feedback to Hardware
# From cl1_neural_interface.py:238-290
def apply_feedback_command (
self ,
neurons : cl.Neurons,
feedback_type : str ,
channels : list ,
frequency : int ,
amplitude : float ,
pulses : int ,
unpredictable : bool ,
event_name : str
):
"""Apply feedback stimulation to neural hardware."""
# Handle interrupt command
if feedback_type == "interrupt" :
if channels:
channel_set = cl.ChannelSet( * channels)
neurons.interrupt(channel_set)
return
# Skip invalid parameters
if not channels or frequency <= 0 or amplitude <= 0 :
return
# Create channel set
channel_set = cl.ChannelSet( * channels)
# Create stimulation design (cached)
cache_key = (feedback_type, tuple (channels), frequency, round (amplitude, 4 ))
def _factory ():
stim_design = cl.StimDesign(
phase1_duration = 120 ,
phase1_amplitude =- amplitude,
phase2_duration = 120 ,
phase2_amplitude = amplitude
)
burst_design = cl.BurstDesign(pulses, frequency)
return (stim_design, burst_design)
stim_design, burst_design = self ._stim_cache.get_or_set(cache_key, _factory)
# Apply stimulation
neurons.stim(channel_set, stim_design, burst_design)
self .feedback_commands_received += 1
Key Points :
Interrupt commands clear ongoing feedback
Event/reward feedback uses same biphasic pulse design as encoder
Stimulation designs are cached (LRU, maxsize=2048)
Non-blocking socket prevents loop stalls
Feedback Timing
Step-Level Feedback
Reward feedback is sent after each environment step :
# In training loop
for step in range (steps_per_update):
# ... encoder, stimulation, spike collection, action ...
reward, done, info = env.step(actions)
# Send reward feedback
if use_reward_feedback and not episode_only_feedback:
if reward > feedback_positive_threshold:
send_feedback_command(
type = "reward" ,
channels = reward_feedback_positive_channels,
frequency = feedback_positive_frequency,
amplitude = feedback_positive_amplitude,
pulses = feedback_positive_pulses
)
elif reward < feedback_negative_threshold:
send_feedback_command(
type = "reward" ,
channels = reward_feedback_negative_channels,
frequency = feedback_negative_frequency,
amplitude = feedback_negative_amplitude,
pulses = feedback_negative_pulses
)
Event-Level Feedback
Event feedback is sent when specific events occur:
# Check for events
for event_name, event_config in event_feedback_settings.items():
if info[event_config.info_key] > 0 : # Event occurred
# Compute surprise
td_error = compute_td_error(reward, value, next_value)
surprise = scale_surprise(td_error, event_config)
# Scale feedback parameters
freq = event_config.base_frequency * ( 1 + surprise * event_config.freq_gain)
amp = event_config.base_amplitude * ( 1 + surprise * event_config.amp_gain)
pulses = int (event_config.base_pulses * ( 1 + surprise * event_config.pulse_gain))
# Send feedback
send_feedback_command(
type = "event" ,
channels = event_config.channels,
frequency = freq,
amplitude = amp,
pulses = pulses,
unpredictable = event_config.unpredictable,
event_name = event_name
)
Episode-Level Feedback
Optional feedback at episode end based on total episode performance:
if use_episode_feedback and done:
if total_episode_reward > 0 :
# Positive episode outcome
send_feedback_command(
type = "event" ,
channels = event_feedback_settings[ 'enemy_kill' ].channels,
frequency = feedback_episode_positive_frequency,
amplitude = feedback_positive_amplitude,
pulses = feedback_episode_positive_pulses
)
else :
# Negative episode outcome
send_feedback_command(
type = "event" ,
channels = event_feedback_settings[ 'took_damage' ].channels,
frequency = feedback_episode_negative_frequency,
amplitude = feedback_negative_amplitude,
pulses = feedback_episode_negative_pulses
)
Configuration
Feedback Parameters
# From PPOConfig in training_server.py
# General feedback settings
use_reward_feedback: bool = True
use_episode_feedback: bool = True
episode_only_feedback: bool = False # If True, skip step-level feedback
episode_feedback_surprise_scaling: bool = True
# Reward feedback thresholds
feedback_positive_threshold: float = 1.0
feedback_negative_threshold: float = - 1.0
# Step-level reward feedback
feedback_positive_frequency: float = 20.0 # Hz
feedback_positive_amplitude: float = 2.0 # μA
feedback_positive_pulses: int = 30
feedback_negative_frequency: float = 60.0 # Hz
feedback_negative_amplitude: float = 2.0 # μA
feedback_negative_pulses: int = 90
# Episode-level feedback
feedback_episode_positive_frequency: float = 40.0
feedback_episode_positive_pulses: int = 80
feedback_episode_negative_frequency: float = 120.0
feedback_episode_negative_pulses: int = 160
# Surprise scaling
feedback_surprise_gain: float = 0.25
feedback_surprise_max_scale: float = 2.0
feedback_surprise_freq_gain: float = 0.65
feedback_surprise_amp_gain: float = 0.35
Start with conservative feedback parameters (low amplitude, low pulse counts) and gradually increase if neurons don’t respond. Excessive feedback can cause adaptation or desensitization.
Design Rationale
Why Separate Feedback Channels?
Avoid Confounding : Encoder-decoder loop learns from spike responses without reward information leaking in
Clear Attribution : Feedback channels explicitly signal reward, not game state
Biological Plausibility : Mimics reward pathways (dopamine, etc.) separate from sensory processing
Debugging : Can disable feedback without affecting encoder-decoder functionality
Why Surprise Scaling?
Learning Efficiency : Focus neural resources on unexpected events
Curriculum Learning : Automatic adjustment as value function improves
Biological Relevance : Mimics prediction error signals in animal brains
Sample Efficiency : Strong feedback when it matters most
Why Unpredictable Stimulation?
Aversion Learning : Irregular patterns harder to adapt to, maintaining discomfort
Safety Incentive : Encourages damage avoidance behaviors
Biological Realism : Pain responses in animals are persistent and irregular
Monitoring Feedback
The CL1 interface logs feedback commands:
# From cl1_neural_interface.py:454-455
if self .feedback_commands_received <= 5 :
print ( f "[FEEDBACK] { feedback_type } on { len (channels) } channels: "
f " { frequency } Hz, { amplitude } μA, { pulses } pulses ( { event_name } )" )
Statistics :
Stats: 1000 ticks |
Recv: 10.0 pkt/s |
Send: 10.0 pkt/s |
Events: 15 |
Feedback: 42 |
Avg spikes: 12.34/tick
Events : Episode metadata logged
Feedback : Total feedback commands processed
Avg spikes : Overall neural activity level
Future Directions
Adaptive Feedback Scaling
Automatically tune base_amplitude and base_frequency based on neural response
Detect and compensate for neural adaptation over time
Multi-Modal Feedback
Combine frequency/amplitude/pulse count scaling
Explore temporal patterns (bursts, ramps)
Channel-Specific Learning
Learn which channels are most effective for reward signaling
Adaptively allocate feedback across channel subsets
Closed-Loop Feedback
Adjust feedback based on decoder confidence
Reduce feedback when decoder is certain, increase when uncertain