Skip to main content

Overview

Rotary Position Embedding (RoPE) is a position encoding method that applies rotation matrices to query and key vectors in attention mechanisms. Unlike learned absolute position embeddings, RoPE:
  • Encodes relative positions through geometric rotations
  • Enables length extrapolation beyond training sequence length
  • Requires no additional parameters
  • Provides better inductive bias for position-dependent patterns
Paper: Su et al. (2021) - RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoPE has become the standard position encoding method in modern LLMs including LLaMA, PaLM, and GPT-NeoX.

Mathematical formulation

Core intuition

The key insight of RoPE is to encode position information through rotations in complex space:
  1. Treat pairs of dimensions as complex numbers: (x₁, x₂) → x₁ + ix₂
  2. Rotate by an angle proportional to position: rotation(m) = e^(imθ)
  3. The relative position between tokens becomes the difference in rotation angles
For a 2D vector at position m, apply rotation by angle :
[x₁']   [cos(mθ)  -sin(mθ)]   [x₁]
[x₂'] = [sin(mθ)   cos(mθ)] × [x₂]
Relative position property: When computing attention between positions m and n:
qₘᵀ kₙ = (Rotation(mθ) × query)ᵀ × (Rotation(nθ) × key)
       = queryᵀ × Rotation((n-m)θ) × key
The attention score depends only on the relative position n-m, not absolute positions.

Implementation

Modern LLM implements RoPE efficiently using precomputed sine/cosine factors:
attention.py:190-211
def _apply_rope(self, tensor: Tensor, seq_len: int, offset: int = 0) -> Tensor:
    """Apply rotary position embeddings to queries or keys."""
    # Get precomputed cos/sin factors
    cos, sin = self._get_rope_factors(seq_len + offset, tensor.device, tensor.dtype)
    
    # Handle offset for attention sinks
    cos = cos[offset : offset + seq_len]
    sin = sin[offset : offset + seq_len]
    
    # Add batch and head dimensions for broadcasting
    cos = cos.unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, head_dim)
    sin = sin.unsqueeze(0).unsqueeze(0)
    
    # Apply rotation: x * cos + rotate_half(x) * sin
    return (tensor * cos) + (self._rotate_half(tensor) * sin)

def _get_rope_factors(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> Tuple[Tensor, Tensor]:
    """Compute cos and sin factors for RoPE."""
    # Compute position × frequency for each position and frequency
    freqs = torch.outer(
        torch.arange(seq_len, device=device),  # positions: [0, 1, 2, ...]
        self.inv_freq.to(device=device)        # frequencies: [θ₀, θ₁, ...]
    )
    
    # Apply frequency scaling if configured
    if self.config.rope_scaling:
        freqs = freqs * self.config.rope_scaling
    
    # Compute cos and sin, repeat each frequency for dimension pairs
    cos = torch.cos(freqs).repeat_interleave(2, dim=-1).to(dtype=dtype)
    sin = torch.sin(freqs).repeat_interleave(2, dim=-1).to(dtype=dtype)
    return cos, sin

@staticmethod
def _rotate_half(x: Tensor) -> Tensor:
    """Rotate half the dimensions: [-x₂, x₁, -x₄, x₃, ...]"""
    x1, x2 = x[..., : x.size(-1) // 2], x[..., x.size(-1) // 2 :]
    return torch.cat([-x2, x1], dim=-1)

Key features

Relative position encoding

The fundamental property of RoPE is that attention scores depend only on relative positions:
attention(qₘ, kₙ) = f(qₘ, kₙ, m-n)
Relative positions are more meaningful than absolute positions because:
  1. Translation invariance: “The cat sat on the mat” has the same structure as “Yesterday, the cat sat on the mat”
  2. Generalization: Patterns learned at position 10 apply at position 100
  3. Extrapolation: Model can handle longer sequences than seen during training
Example:
Query at position 5:  "dog"
Key at position 3:    "the"
Relative position:    +2

Query at position 105: "dog" 
Key at position 103:   "the"
Relative position:     +2  (same relationship!)
For positions m and n, after applying RoPE:
qₘ' = Rotation(mθ) × qₘ
kₙ' = Rotation(nθ) × kₙ

qₘ'ᵀ kₙ' = qₘᵀ Rotation(mθ)ᵀ Rotation(nθ) kₙ
          = qₘᵀ Rotation((n-m)θ) kₙ
Where we used the rotation property:
R(mθ)ᵀ R(nθ) = R(-mθ) R(nθ) = R((n-m)θ)
Thus, the attention score depends only on (n-m), the relative position.

Length extrapolation

RoPE enables models to generalize to longer sequences than seen during training:
Training: Model learns on sequences up to length 2048
  • Sees relative positions from -2048 to +2048
  • Learns attention patterns for these ranges
Inference: Generate sequence of length 4096
  • Relative positions from -4096 to +4096
  • RoPE rotations are smooth and continuous
  • Model can interpolate to unseen relative positions
Contrast with learned embeddings:
  • Learned position embeddings have fixed size (e.g., 2048)
  • Cannot generate position 2049 (out of bounds)
  • RoPE has no such limit

No additional parameters

Unlike learned position embeddings, RoPE adds zero parameters to the model:
# Learned absolute embeddings (e.g., GPT-2)
self.position_embeddings = nn.Embedding(max_seq_len, d_model)
# Parameters: max_seq_len × d_model = 2048 × 768 = 1,572,864

# RoPE (Modern LLM)
inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
# Parameters: 0 (just a fixed buffer)
For a 12-layer model with d=768, max_seq_len=2048, RoPE saves 1.57M parameters compared to learned position embeddings.

Hyperparameters

Base frequency (theta)

The base frequency controls the wavelength spectrum:
rope_theta = 10000.0  # Default
Higher theta → longer wavelengths → better long-range dependencies:
ThetaMin WavelengthMax WavelengthUse case
1,0006.286,280Short sequences
10,0006.2862,800Standard (2-4K)
100,0006.28628,000Long context (8-16K)
1,000,0006.286,280,000Extreme length (32K+)
Wavelength calculation:
λᵢ = 2π × theta^(2i/d)

Position scaling

For length extrapolation beyond training:
rope_scaling = 0.5  # Compress positions by 2×
Rule of thumb: Set rope_scaling = train_length / target_length when target length is 2-4× training length.Beyond 4×, consider retraining with longer sequences or using advanced techniques like YaRN (Peng et al., 2023).

Comparison to other position encodings

MethodTypeParametersExtrapolationRelative/Absolute
Learned embeddings (GPT-2)LearnedL × dPoorAbsolute
Sinusoidal (Original Transformer)Fixed0GoodAbsolute
Relative embeddings (T5)LearnedO(L)ModerateRelative
ALiBi (Press et al., 2022)Fixed0ExcellentRelative
RoPE (Su et al., 2021)Fixed0ExcellentRelative
Modern consensus: RoPE and ALiBi are the top choices for position encoding in 2023+ LLMs. RoPE is more widely adopted and has slightly better performance on benchmarks.

Common issues and solutions

Error: RoPE requires an even head dimensionCause: RoPE operates on pairs of dimensions, so head dimension must be even.Solution:
# Ensure head_dim is even
assert config.d_model % config.n_heads == 0
head_dim = config.d_model // config.n_heads
assert head_dim % 2 == 0, f"Head dim {head_dim} must be even"

# Example valid configurations:
config.d_model = 768; config.n_heads = 12  # head_dim=64 ✓
config.d_model = 512; config.n_heads = 8   # head_dim=64 ✓

# Invalid:
config.d_model = 768; config.n_heads = 7   # head_dim≈109.7 ✗
Symptoms: Model performance degrades on sequences longer than training lengthSolutions:
  1. Use position scaling:
    rope_scaling = train_length / inference_length
    
  2. Increase base frequency:
    rope_theta = 100000.0  # from 10000.0
    
  3. Fine-tune on longer sequences:
    # Continue training with longer sequences
    config.max_seq_len = 8192  # from 2048
    # Train for a few thousand steps
    
Issue: How does RoPE work with attention sinks?Solution: Apply RoPE with position offset for query, normal for keys:
attention.py:108-110
if self.config.use_rope:
    # Query positions are offset by num_sinks
    q = self._apply_rope(q, seq_len, offset=num_attention_sinks)
    # Key positions start at 0 for sinks, then regular sequence
    k = self._apply_rope(k, seq_len)
This ensures sink tokens have positions 0, 1, … and regular tokens continue from there.

Advanced topics

Use different frequency ranges for different heads:
# Each head gets different frequency range
for head_idx in range(n_heads):
    scale = 2.0 ** (head_idx / n_heads)  # 1.0 to 2.0
    freqs[head_idx] = base_freqs * scale
Benefits:
  • Some heads specialize in short-range dependencies
  • Other heads handle long-range dependencies
  • Improves representational capacity
Extend RoPE to 2D for vision transformers:
# Separate frequencies for height and width
freqs_h = compute_freqs(height_positions, theta_h)
freqs_w = compute_freqs(width_positions, theta_w)

# Apply to half the dimensions each
x[:, :d//2] = apply_rope(x[:, :d//2], freqs_h)
x[:, d//2:] = apply_rope(x[:, d//2:], freqs_w)
Used in models like ViT-RoPE for image understanding.
Advanced technique for extreme length extrapolation (Peng et al., 2023):
  • Dynamically adjust frequencies based on attention distance
  • Temperature scaling for attention scores
  • Enables 64K+ context with 4K training
See paper: YaRN: Efficient Context Window Extension of Large Language Models

References

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su et al., 2021 - Original RoPE paper

Extending Context Window of Large Language Models via Position Interpolation

Chen et al., 2023 - Position interpolation technique

YaRN: Efficient Context Window Extension

Peng et al., 2023 - Advanced extrapolation method

LLaMA: Open and Efficient Foundation Language Models

Touvron et al., 2023 - Modern LLM using RoPE

See also

Architecture overview

Learn about the full model architecture

Attention sinks

Combine RoPE with attention sinks for long-context generation

Multi-head attention

See full attention implementation

Build docs developers (and LLMs) love