Overview
DOOM Neuron automatically saves training checkpoints during PPO training. Checkpoints allow you to:- Resume training after interruptions
- Evaluate trained policies in watch mode
- Transfer learning between scenarios
- Backup training progress
Checkpoint Location
By default, checkpoints are saved to:What’s Saved in Checkpoints
Each.pt file contains:
- Policy network weights (encoder + decoder)
- Value network weights
- Optimizer state (Adam parameters)
- Episode number
- Training statistics
- PPO configuration
Saving Checkpoints
Automatic Saving
Checkpoints are saved automatically during training:- Every N episodes (configured in code)
- When training completes (
final_model.pt) - On graceful shutdown (
Ctrl+C)
Manual Backup
To preserve a checkpoint directory:Loading Checkpoints
Load checkpoints to resume training or run inference.Resume Training
- Local Development
- Remote Training
Resume from a specific checkpoint:This will:
- Restore policy, value, and optimizer states
- Continue from episode 5000
- Preserve learning rate schedule
- Use existing TensorBoard logs
Watch Mode (Inference)
Run a trained policy without further learning:- Runs
trainer.train()without policy updates - Policy network frozen (no gradient updates)
- Useful for evaluation and demonstration
- Still requires CL1 interface running
Advanced Scenarios
Transfer Learning Between Scenarios
Load a checkpoint trained on one scenario and continue on another:Train on Easier Scenario
Start with Train until convergence, producing
deadly_corridor_1.cfg:checkpoints/corridor_stage1/final_model.pt.Resume on Harder Scenario
Load stage 1 checkpoint and continue on stage 2:Update
PPOConfig.doom_config in code to "deadly_corridor_2.cfg" before running.Checkpoint Comparison
Evaluate multiple checkpoints to find the best performer:Backup Strategy
Recommended backup workflow for long training runs:Checkpoint File Format
Checkpoint files are PyTorch.pt files (pickled dictionaries):
Inspecting Checkpoints
Load and inspect checkpoint contents:Troubleshooting
Checkpoint Not Found
Error:FileNotFoundError: checkpoints/episode_5000.pt
Solutions:
- Verify file exists:
ls -lh checkpoints/ - Use absolute path:
--checkpoint /home/user/doom-neuron/checkpoints/episode_5000.pt - Check for typos in filename
Incompatible Checkpoint
Error:RuntimeError: Error loading state_dict
Solutions:
- Checkpoint may be from different model architecture
- Ensure
PPOConfigmatches checkpoint configuration - Check PyTorch version compatibility
- Try loading with
map_location='cpu'first
Missing TensorBoard Logs
Symptom: TensorBoard shows no data after resuming Solutions:- Ensure
checkpoint_dirmatches original training directory - TensorBoard logs are in
checkpoints/l5_2048_rand/logs/ - Use
--logdirpointing to correct logs directory - New logs append to existing event files
Disk Space Issues
Symptom: Training fails with disk full error Solutions:- Monitor disk usage:
df -h - Remove old checkpoints:
rm checkpoints/episode_*.pt - Keep only periodic checkpoints (every 1000 episodes)
- Compress old checkpoints:
gzip checkpoints/episode_*.pt - Use external storage for long-term backups
Output Files Summary
Training outputs:Best Practices
Save Frequently
Configure checkpoint frequency based on training duration:
- Short runs (under 1000 episodes): Save every 100 episodes
- Long runs (over 10000 episodes): Save every 500-1000 episodes
Keep Multiple Versions
Don’t rely on a single checkpoint:
- Keep last 3-5 periodic checkpoints
- Save milestone checkpoints (curriculum completions)
- Backup
final_model.ptbefore new training runs
Next Steps
- Learn about DOOM scenarios to train across different challenges
- Set up remote training for production checkpoint workflows
- Monitor training with TensorBoard:
tensorboard --logdir checkpoints/l5_2048_rand/logs