Motion Diffusion Model

Overview

The Motion Diffusion Model (MDM) is a transformer-based denoising diffusion probabilistic model that generates kinematic motion sequences conditioned on terrain geometry and target directions. Key Features:

Transformer encoder architecture with self-attention
CNN-based heightfield conditioning
Autoregressive generation support
DDIM sampling for fast inference
Predict-x0 formulation

Source: PARC/motion_generator/mdm.py:116-1188

Architecture

Motion Representation

Feature Components

Motions are represented as sequences of frames, where each frame can include:

frame_components = [
    "ROOT_POS",      # 3D position
    "ROOT_ROT",      # Rotation (representation varies)
    "JOINT_ROT",     # Per-joint rotations
    "CONTACTS"       # Binary contact labels
]

Source: PARC/motion_generator/mdm.py:287-326

Rotation Representations

Supports multiple rotation types (configured via rot_type):

Quaternion (4D) - Default, no gimbal lock
Exponential Map (3D) - Compact, used in some losses
6D Rotation (6D) - Continuous, good for learning
Rotation Matrix (9D) - Redundant but explicit

Rotations are converted internally as needed. Source: PARC/motion_generator/mdm.py:178-179

Standardization

All features are standardized (z-score normalization):

standardized = (features - mean) / std

Mean and std computed across entire dataset and cached to sampler_stats.txt. Source: PARC/motion_generator/mdm.py:442-513

Contact labels (0 or 1) are NOT standardized - their mean is set to 0 and std to 1.

Transformer Architecture

Class: MDMTransformer
Source: PARC/motion_generator/mdm_transformer.py:16-193

Token Structure

The transformer processes a sequence of tokens:

Timestep token (1) - Embedded diffusion timestep
Observation tokens (M) - CNN-encoded heightfield features
Target token (1) - Target direction embedding
Noise indicator token (1) - Whether previous state is noisy
Motion tokens (N) - Motion sequence frames

Total sequence length: 1 + M + 1 + 1 + N Source: PARC/motion_generator/mdm_transformer.py:44-96

Timestep Embedding

Sinusoidal positional encoding for diffusion timestep:

class TimestepEmbedder(nn.Module):
    def __init__(self, d_model, max_timesteps):
        # Projects timestep to d_model dimensions
        # Uses learnable MLP after sinusoidal encoding

Source: PARC/motion_generator/diffusion_util.py

Heightfield Conditioning

Local heightfield is encoded via CNN:

cnn:
  net_name: "HeightfieldCNN"  
  # Typical config:
  # - Conv layers to extract spatial features
  # - Global pooling or flattening
  # - Output: M tokens of dimension d_token

CNN produces multiple tokens (e.g., M=4) that attend with motion tokens. Source: PARC/motion_generator/mdm_transformer.py:54-62

Attention Mask

Key padding mask controls which tokens attend to which:

key_padding_mask = torch.zeros(batch_size, full_seq_len)
# Set to True to mask (ignore) specific tokens
key_padding_mask[:, obs_token_idx] = ~obs_flag  # Dropout
key_padding_mask[:, target_token_idx] = ~target_flag

Enables classifier-free guidance and conditional dropout. Source: PARC/motion_generator/mdm_transformer.py:99-135

Transformer Layers

TransformerEncoderLayer(
    d_model=512,           # Token dimension
    num_heads=8,           # Attention heads  
    d_hid=2048,           # FFN hidden dimension
    dropout=0.1,
    activation="gelu",
    batch_first=True
)

Typically 8-12 layers deep. Source: PARC/motion_generator/mdm_transformer.py:85-88

Diffusion Process

Forward Diffusion

Adds noise to clean motion over T timesteps:

def forward_diffusion(x_0, t):
    noise = torch.randn_like(x_0)
    alpha_bar = sqrt_alphas_cumprod[t]
    sigma = sqrt_one_minus_alphas_cumprod[t]  
    x_t = alpha_bar * x_0 + sigma * noise
    return x_t

Uses variance schedule (linear, cosine, etc.). Source: PARC/motion_generator/mdm.py:777-788

Reverse Diffusion (DDPM)

Denoises motion step-by-step:

for t in reversed(range(T)):
    # Predict clean motion from noisy x_t
    x_0_pred = model(x_t, conds, t)
    
    # Compute posterior q(x_{t-1} | x_t, x_0)  
    mean = coef1 * x_0_pred + coef2 * x_t
    x_t = mean + posterior_std * noise

Requires T forward passes. Source: PARC/motion_generator/mdm.py:855-894

DDIM Sampling

Faster sampling with stride > 1:

timesteps = reversed(range(0, T, stride))
for t in timesteps:
    x_0_pred = model(x_t, conds, t)
    # Deterministic update (no noise added)
    x_t = ddim_step(x_0_pred, x_t, t, t - stride)

With stride=10, only 100 steps instead of 1000. Source: PARC/motion_generator/mdm.py:932-988

Use DDIM with stride 10-20 for fast inference with minimal quality loss. DDPM (stride=1) gives best quality but is slowest.

Conditioning

Heightfield Observations

Local terrain around character:

heightmap:
  local_grid:
    num_x_neg: 10  # Cells behind character
    num_x_pos: 30  # Cells in front  
    num_y_neg: 10  # Cells to left
    num_y_pos: 10  # Cells to right
  horizontal_scale: 0.1  # Cell size in meters
  max_h: 5.0  # Max height for normalization

Heightfield normalized to [-1, 1] range. Source: PARC/motion_generator/mdm.py:184-200

Target Direction

Desired movement direction:

target_type = "XY_DIR"  # Unit vector in XY plane
# Alternatives:
# - "XY_POS": Target position (2D)
# - "XY_POS_AND_HEADING": Position + heading angle (3D)

Target computed from end of motion sequence during training. Source: PARC/motion_generator/mdm.py:208-217

Previous States

Temporal coherence via autoregression:

num_prev_states = 2  # Condition on last 2 frames

Previous states can be:

Clean (noise indicator = 1) - For autoregressive generation
Noisy (noise indicator = 0) - For training robustness

Source: PARC/motion_generator/mdm.py:143-145

Dropout Strategies

Random dropout during training:

obs_dropout: 0.1        # Ignore heightfield 10% of time
target_dropout: 0.1     # Ignore target 10% of time  
prev_state_attention_dropout: 0.1
prev_state_noise_ind_chance: 0.3  # Use noisy prev state

Improves robustness and enables classifier-free guidance. Source: PARC/motion_generator/mdm.py:135-145

Loss Functions

Multiple loss terms weighted and summed:

Simple Losses

# Root position L2 loss
root_pos_loss = ||pred_root_pos - true_root_pos||^2

# Rotation losses use quaternion angle difference  
root_rot_loss = angle_diff(pred_root_rot, true_root_rot)^2
joint_rot_loss = sum(angle_diff(pred_joint_rot, true_joint_rot)^2)

# Contact binary cross-entropy
contact_loss = ||pred_contacts - true_contacts||^2

Source: PARC/motion_generator/mdm.py:622-637

Velocity Losses

# Finite difference velocities
dt = 1.0 / fps
root_vel = (root_pos[t+1] - root_pos[t]) / dt
vel_loss = ||pred_root_vel - true_root_vel||^2

# Angular velocities from quaternion differences
root_ang_vel = quat_to_exp_map(quat_diff(rot[t], rot[t+1])) / dt  
ang_vel_loss = ||pred_ang_vel - true_ang_vel||^2

Source: PARC/motion_generator/mdm.py:600-620

Forward Kinematics Losses

# Compute body positions via FK
body_pos, body_rot = forward_kinematics(root_pos, root_rot, joint_rot)

# Match body positions in 3D space
fk_body_pos_loss = ||pred_body_pos - true_body_pos||^2
fk_body_rot_loss = sum(angle_diff(pred_body_rot, true_body_rot)^2)

Ensures physically plausible poses. Source: PARC/motion_generator/mdm.py:646-655

Heightfield Collision Loss

# Sample points on character body
char_points = sample_body_points(body_pos, body_rot)

# Compute SDF (signed distance function) to heightfield  
sdf = compute_sdf(char_points, heightfield)

# Penalize penetration (sdf < 0)
collision_loss = sum(clamp(sdf, max=0)^2)

Prevents ground penetration. Source: PARC/motion_generator/mdm.py:669-677

Loss Weighting

w_simple_root_pos: 1.0
w_simple_root_rot: 10.0  
w_simple_joint_rot: 10.0
w_simple_contacts: 0.1
w_vel_root_pos: 0.1
w_vel_root_rot: 1.0
w_vel_joint_rot: 0.01  
w_body_pos: 1.0
w_body_rot: 1.0
w_hf: 10.0  # Collision loss

Tuned based on importance and scale. Source: PARC/motion_generator/mdm.py:149-171

Training

Training Loop

for epoch in range(epochs):
    for iter in range(iters_per_epoch):
        # Sample batch from dataset
        motion_data, hfs, target_info = sampler.sample(batch_size)
        
        # Construct conditions
        conds = construct_conds(motion_data, hfs, target_info)
        x_0 = assemble_features(motion_data)
        
        # Random timestep
        t = random.randint(0, diffusion_timesteps)
        
        # Forward diffusion  
        x_t = forward_diffusion(x_0, t)
        
        # Predict clean motion
        x_0_pred = model(x_t, conds, t)
        
        # Compute losses
        loss = compute_losses(x_0, x_0_pred, conds)
        
        # Backprop and update
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(parameters, 1.0)
        optimizer.step()
        
        # Update EMA model
        ema.update(model)

Source: PARC/motion_generator/mdm.py:1067-1096

Exponential Moving Average (EMA)

Maintains smoothed model weights:

if step > ema_start:
    for p, p_ema in zip(model.parameters(), ema_model.parameters()):
        p_ema.data = decay * p_ema.data + (1 - decay) * p.data

Typically decay=0.9999. Source: PARC/motion_generator/mdm.py:250-285

Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Prevents exploding gradients. Source: PARC/motion_generator/mdm.py:1091

Inference

Autoregressive Generation

Generate long sequences by chaining:

# Initialize with starting pose
prev_states = initial_pose

for segment in path:
    # Sample heightfield at current position
    hf = sample_heightfield(position, heading)
    
    # Compute target from path
    target = compute_target(position, path)
    
    # Generate next segment
    conds = {
        "prev_state": prev_states,
        "obs": hf,
        "target": target
    }
    
    motion_segment = model.generate(conds, ddim_stride=10)
    
    # Update for next iteration  
    prev_states = motion_segment[-2:]
    position += compute_displacement(motion_segment)

Source: PARC/motion_synthesis/procgen/mdm_path.py

Classifier-Free Guidance

Strengthen conditioning at inference:

# Generate with and without conditioning
x_0_cond = model(x_t, conds, t)
x_0_uncond = model(x_t, empty_conds, t)

# Interpolate predictions
scale = 1.5  # Guidance scale
x_0 = x_0_uncond + scale * (x_0_cond - x_0_uncond)

Increases adherence to terrain/target at cost of diversity. Source: PARC/motion_generator/mdm.py:836-851

Configuration Example

# Model architecture
d_model: 512
num_heads: 8  
d_hid: 2048
num_layers: 8
dropout: 0.1

# Sequence parameters  
sequence_duration: 2.0  # seconds
sequence_fps: 30
num_prev_states: 2

# Diffusion
diffusion_timesteps: 1000
predict_mode: "PREDICT_X0"
test_mode: "MODE_DDIM"  
test_ddim_stride: 10

# Training
batch_size: 512
epochs: 1000  
iters_per_epoch: 100
lr: 0.0001
weight_decay: 0.0

# Features
features:
  rot_type: "QUAT"  
  frame_components:
    - "ROOT_POS"
    - "ROOT_ROT"
    - "JOINT_ROT"  
    - "CONTACTS"

# Conditioning
use_heightmap_obs: true
use_target_obs: true
target_type: "XY_DIR"
obs_dropout: 0.1
target_dropout: 0.1

# Loss weights
w_simple_root_pos: 1.0
w_simple_root_rot: 10.0
w_simple_joint_rot: 10.0  
w_vel_root_pos: 0.1
w_body_pos: 1.0
w_hf: 10.0

Implementation Tips

Memory optimization: Use gradient checkpointing for transformer layers if GPU memory is limited. Can train with 2x larger batch size at ~20% slower speed.

Quaternion normalization: Always normalize quaternions after prediction to ensure unit length. The model may predict non-normalized values.

EMA for inference: Use the EMA model for generation, not the latest training model. EMA weights are more stable and produce better quality.

Get Started

Core Concepts

Guides

Resources

​Overview

​Architecture

​Motion Representation

​Feature Components

​Rotation Representations

​Standardization

​Transformer Architecture

​Token Structure

​Timestep Embedding

​Heightfield Conditioning

​Attention Mask

​Transformer Layers

​Diffusion Process

​Forward Diffusion

​Reverse Diffusion (DDPM)

​DDIM Sampling

​Conditioning

​Heightfield Observations

​Target Direction

​Previous States

​Dropout Strategies

​Loss Functions

​Simple Losses

​Velocity Losses

​Forward Kinematics Losses

​Heightfield Collision Loss

​Loss Weighting

​Training

​Training Loop

​Exponential Moving Average (EMA)

​Gradient Clipping

​Inference

​Autoregressive Generation

​Classifier-Free Guidance

​Configuration Example

​Implementation Tips

Build docs developers (and LLMs) love

Overview

Architecture

Motion Representation

Feature Components

Rotation Representations

Standardization

Transformer Architecture

Token Structure

Timestep Embedding

Heightfield Conditioning

Attention Mask

Transformer Layers

Diffusion Process

Forward Diffusion

Reverse Diffusion (DDPM)

DDIM Sampling

Conditioning

Heightfield Observations

Target Direction

Previous States

Dropout Strategies

Loss Functions

Simple Losses

Velocity Losses

Forward Kinematics Losses

Heightfield Collision Loss

Loss Weighting

Training

Training Loop

Exponential Moving Average (EMA)

Gradient Clipping

Inference

Autoregressive Generation

Classifier-Free Guidance

Configuration Example

Implementation Tips