Overview
The motion tracking controller is a reinforcement learning agent that learns to track kinematic reference motions in physics simulation. It uses Proximal Policy Optimization (PPO) with DeepMimic-style rewards in Isaac Gym.
Key Components:
- PPO algorithm with GAE
- DeepMimic tracking rewards
- Isaac Gym GPU-accelerated physics
- Parallel training across thousands of environments
- Adaptive motion weighting
Source: PARC/motion_tracker/learning/dm_ppo_agent.py
Architecture
Environment: DeepMimic
Simulation Setup
Platform: NVIDIA Isaac Gym
Physics: Position-Based Dynamics (PBD)
Timestep: 60Hz (0.0167s)
Parallel envs: Typically 4000-16000
Source: PARC/motion_tracker/envs/ig_parkour/dm_env.py:19-93
Character Control
The simulated character uses PD (Proportional-Derivative) controllers:
# Agent outputs target joint positions
target_dof_pos = policy(observation)
# PD controller computes torques
kp = 1000.0 # Proportional gain
kd = 100.0 # Derivative gain
error_pos = target_dof_pos - current_dof_pos
error_vel = -current_dof_vel # Target velocity is 0
torque = kp * error_pos + kd * error_vel
torque = clamp(torque, -torque_limit, torque_limit)
Torques applied to joints each simulation step.
Terrain Layout
Environments arranged in 2D grid to avoid numerical issues:
num_envs = num_motions * terrains_per_motion
# Grid dimensions
num_envs_x = ceil(sqrt(num_envs))
num_envs_y = ceil(sqrt(num_envs))
# Each environment has offset
env_offset_x = env_id % num_envs_x * terrain_width
env_offset_y = env_id // num_envs_x * terrain_height
Terrains loaded from motion files and converted to Isaac Gym trimeshes.
Source: PARC/motion_tracker/envs/ig_parkour/dm_env.py:182-199
Characters
Two characters per environment:
- Simulated character (index 0) - Controlled by agent, physics-based
- Reference character (index 1) - Kinematic, plays back reference motion
Reference character is transparent/ghosted for visualization.
Source: PARC/motion_tracker/envs/ig_parkour/dm_env.py:16-17
Observation Space
The agent observes:
obs_components = [
# Character state
"root_pos", # 3D (relative to ref)
"root_rot", # 4D quaternion
"root_vel", # 3D linear velocity
"root_ang_vel", # 3D angular velocity
"dof_pos", # N_dof joint positions
"dof_vel", # N_dof joint velocities
# Reference motion (target)
"ref_root_pos", # 3D
"ref_root_rot", # 4D quaternion
"ref_dof_pos", # N_dof
"ref_body_pos", # N_body x 3D (from FK)
"ref_body_rot", # N_body x 4D quaternions
# Terrain
"heightmap", # H x W grid
# Contacts
"ref_contacts", # N_body binary labels
]
Observations are relative (sim - ref) to make learning easier.
Observation Normalization
Running mean/std normalization:
class Normalizer:
def normalize(self, obs):
self.mean = 0.99 * self.mean + 0.01 * obs.mean()
self.std = 0.99 * self.std + 0.01 * obs.std()
return (obs - self.mean) / (self.std + 1e-5)
Some observations excluded from normalization:
- Heightmaps (already in [-1, 1])
- Contact labels (binary)
Source: PARC/motion_tracker/learning/dm_ppo_agent.py:49-86
Observations are clipped to [-5, 5] after normalization to prevent extreme values from destabilizing training.
Reward Function
DeepMimic Rewards
Reward is product of individual terms (all in [0, 1]):
reward = (
w_root_pos * exp(-k_pos * ||sim_root - ref_root||^2) *
w_root_rot * exp(-k_rot * angle_diff(sim_rot, ref_rot)^2) *
w_body_pos * exp(-k_body * mean(||sim_body - ref_body||^2)) *
w_body_rot * exp(-k_body_rot * mean(angle_diff(sim_body_rot, ref_body_rot)^2)) *
w_dof_vel * exp(-k_vel * ||sim_vel - ref_vel||^2) *
w_end_eff * exp(-k_eff * mean(||sim_eff - ref_eff||^2))
)
Reward Weights
reward_weights:
w_root_pos: 0.3
w_root_rot: 0.2
w_body_pos: 0.3
w_body_rot: 0.1
w_dof_vel: 0.05
w_end_eff: 0.05
Reward Scales
reward_scales:
k_pos: 10.0 # Position error -> reward
k_rot: 5.0 # Rotation error -> reward
k_body: 20.0 # Body pos error -> reward
k_vel: 0.1 # Velocity error -> reward
Higher scale = more sensitive to errors.
Early Termination
Episode ends early if:
# Tracking error too high
root_pos_error > 1.0 # 1 meter
root_rot_error > pi/2 # 90 degrees
body_pos_error > 0.5 # Per body
# Character falls
root_height < 0.3 # Below threshold
# Motion complete
motion_time >= motion_length
Early termination prevents learning invalid behaviors.
PPO Algorithm
Class: DMPPOAgent
Source: PARC/motion_tracker/learning/dm_ppo_agent.py:17-363
Hyperparameters
# Rollout
rollout_length: 8 # Steps per rollout
num_envs: 4096 # Parallel environments
# Optimization
mini_batch_size: 4096 # Samples per mini-batch
num_epochs: 5 # Epochs per update
learning_rate: 0.0001
clip_epsilon: 0.2 # PPO clipping
# GAE (Generalized Advantage Estimation)
discount: 0.99 # Gamma
td_lambda: 0.95 # Lambda for GAE
# Normalization
norm_adv_clip: 5.0 # Advantage clipping
norm_obs_clip: 5.0 # Observation clipping
Policy and Value Networks
class DMPPOModel:
def __init__(self):
# Policy network
self.policy = MLP([
obs_dim,
1024,
512,
512,
action_dim * 2 # Mean and log_std
])
# Value network (can be shared or separate)
self.value = MLP([
obs_dim,
1024,
512,
512,
1 # State value
])
Action distribution is diagonal Gaussian:
mean, log_std = policy(obs).chunk(2, dim=-1)
std = exp(log_std)
action_dist = Normal(mean, std)
action = action_dist.sample()
Training Loop
for iteration in range(max_iterations):
# Rollout phase
for step in range(rollout_length):
action = policy.sample(obs)
next_obs, reward, done, info = env.step(action)
buffer.store(obs, action, reward, done, value)
obs = next_obs
# Compute advantages with GAE
advantages = compute_gae(rewards, values, dones, gamma, lambda)
# Normalize advantages
advantages = (advantages - advantages.mean()) / advantages.std()
# Update phase
for epoch in range(num_epochs):
for batch in buffer.mini_batches(mini_batch_size):
# Compute policy loss (clipped)
ratio = new_prob / old_prob
clipped_ratio = clamp(ratio, 1-eps, 1+eps)
policy_loss = -min(ratio * adv, clipped_ratio * adv)
# Compute value loss
value_loss = (value - target_value)^2
# Entropy bonus for exploration
entropy_loss = -entropy(action_dist)
# Total loss
loss = policy_loss + 0.5 * value_loss + 0.01 * entropy_loss
# Update
optimizer.zero_grad()
loss.backward()
optimizer.step()
Source: PARC/motion_tracker/learning/ppo_agent.py
GAE (Generalized Advantage Estimation)
def compute_gae(rewards, values, dones, gamma, lambda):
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
if dones[t]:
next_value = 0
else:
next_value = values[t+1]
# TD error
delta = rewards[t] + gamma * next_value - values[t]
# GAE recursion
gae = delta + gamma * lambda * gae
advantages.insert(0, gae)
return advantages
Source: PARC/motion_tracker/learning/rl_util.py
Experience Buffer
Stores rollout data:
class ExperienceBuffer:
buffers = {
"obs": (rollout_length, num_envs, obs_dim),
"action": (rollout_length, num_envs, action_dim),
"reward": (rollout_length, num_envs),
"done": (rollout_length, num_envs),
"value": (rollout_length, num_envs),
"log_prob": (rollout_length, num_envs),
# Computed after rollout
"advantage": (rollout_length, num_envs),
"target_value": (rollout_length, num_envs),
}
Source: PARC/motion_tracker/learning/dm_ppo_agent.py:239-264
Adaptive Motion Weighting
Failure Rate Tracking
Each motion has a tracked failure rate:
motion_id_fail_rates = torch.ones(num_motions) # Initialize to 1.0
# After each episode
if done:
if success:
fail_rate_update = 0.0
else:
fail_rate_update = 1.0
# Exponential moving average
ema_weight = 0.01
motion_id_fail_rates[motion_id] = (
(1 - ema_weight) * motion_id_fail_rates[motion_id] +
ema_weight * fail_rate_update
)
Source: PARC/motion_tracker/envs/ig_parkour/dm_env.py:86-90
Sampling Weights
Motions with higher fail rates sampled more often:
fail_rate_quantiles = [0.1, 0.3, 0.5, 0.7] # Thresholds
# Assign weights based on quantile
for motion_id in range(num_motions):
fail_rate = motion_id_fail_rates[motion_id]
if fail_rate < quantiles[0]:
weight = 0.1 # Easy motion - low weight
elif fail_rate < quantiles[1]:
weight = 0.3
elif fail_rate < quantiles[2]:
weight = 0.5
elif fail_rate < quantiles[3]:
weight = 0.7
else:
weight = 1.0 # Hard motion - high weight
sampling_weights[motion_id] = weight
Focuses training on difficult motions.
Source: PARC/motion_tracker/envs/ig_parkour/dm_env.py:32-33
Weight Persistence
Fail rates saved to checkpoint:
# Save
torch.save(motion_id_fail_rates.cpu(), "fail_rates_iter.pt")
# Load for next iteration
motion_id_fail_rates = torch.load("fail_rates_iter.pt").to(device)
Source: PARC/motion_tracker/learning/dm_ppo_agent.py:360-362
Training Details
Curriculum Learning
Start with easier motions, gradually add harder:
- Iteration 0: Small dataset of clean reference motions
- Iteration 1: Add some generated motions (easier terrain)
- Iteration 2+: Progressively harder terrain and longer motions
Testing During Training
Periodic evaluation:
if iter % iters_per_test == 0:
policy.eval()
test_info = test_model(num_test_episodes=100)
metrics = {
"mean_return": test_info["mean_return"],
"mean_ep_len": test_info["mean_ep_len"],
"root_pos_err": test_info["root_pos_err"],
"root_rot_err": test_info["root_rot_err"],
"body_pos_err": test_info["body_pos_err"],
}
logger.log(metrics)
policy.train()
Source: PARC/motion_tracker/learning/dm_ppo_agent.py:94-180
Checkpointing
if iter % iters_per_checkpoint == 0:
checkpoint = {
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"obs_normalizer": obs_norm.state_dict(),
"iter": iter,
}
torch.save(checkpoint, f"model_{iter}.pt")
Source: PARC/motion_tracker/learning/ppo_agent.py
Tracking Error Metrics
Detailed error tracking:
class TrackingErrorTracker:
def update(self, error_dict, done):
self.root_pos_err += error_dict["root_pos_err"]
self.root_rot_err += error_dict["root_rot_err"]
self.body_pos_err += error_dict["body_pos_err"]
self.body_rot_err += error_dict["body_rot_err"]
self.dof_vel_err += error_dict["dof_vel_err"]
self.root_vel_err += error_dict["root_vel_err"]
# Count episodes
self.num_episodes += done.sum()
Source: PARC/motion_tracker/learning/tracking_error_tracker.py
GPU Acceleration
Isaac Gym runs physics on GPU:
- All tensors on GPU
- No CPU-GPU transfers during rollout
- Parallel simulation of thousands of envs
Batch Processing
All operations batched:
# Single forward pass for all envs
actions = policy(obs) # Shape: (num_envs, action_dim)
# Single simulation step for all envs
next_obs, rewards, dones, infos = env.step(actions)
Memory Layout
Contiguous tensors for efficiency:
# Flatten multi-dimensional observations
obs = torch.cat([
root_pos.view(num_envs, -1),
root_rot.view(num_envs, -1),
dof_pos.view(num_envs, -1),
heightmap.view(num_envs, -1),
], dim=-1)
Configuration Example
# Environment
env:
num_envs: 4096
sim_device: "cuda:0"
headless: true
dm:
motion_file: "data/motions.yaml"
terrains_per_motion: 1
random_reset_pos: false
terrain_build_mode: "file" # Or "square", "wide"
fail_rate_quantiles: [0.1, 0.3, 0.5, 0.7]
min_motion_weight: 0.01
heightmap:
horizontal_scale: 0.1
padding: 1.0
reward_weights:
w_root_pos: 0.3
w_root_rot: 0.2
w_body_pos: 0.3
w_body_rot: 0.1
w_dof_vel: 0.05
w_end_eff: 0.05
# Agent
agent:
algorithm: "DM_PPO"
rollout:
rollout_length: 8
training:
num_epochs: 5
mini_batch_size: 4096
learning_rate: 0.0001
ppo:
clip_epsilon: 0.2
gae:
discount: 0.99
td_lambda: 0.95
normalization:
norm_adv_clip: 5.0
norm_obs_clip: 5.0
# Model
model:
policy_layers: [1024, 512, 512]
value_layers: [1024, 512, 512]
activation: "relu"
# Training loop
max_samples: 100000000 # 100M samples
iters_per_output: 100
iters_per_checkpoint: 500
test_episodes: 100
Tips
Warm start: Initialize the tracker from a previous iteration’s checkpoint to speed up training on new motions.
Reward scaling: Ensure rewards are properly scaled (typically [0, 1] range). Improperly scaled rewards can destabilize PPO.
Parallelization: More parallel environments = better sample efficiency. Aim for at least 2048-4096 environments for stable training.
Testing frequency: Test every 100-200 iterations to monitor progress without slowing down training significantly.