RoPE

Overview

Rotary Position Embedding (RoPE) is a position encoding method that applies rotation matrices to query and key vectors in attention mechanisms. Unlike learned absolute position embeddings, RoPE:

Encodes relative positions through geometric rotations
Enables length extrapolation beyond training sequence length
Requires no additional parameters
Provides better inductive bias for position-dependent patterns

Paper: Su et al. (2021) - RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoPE has become the standard position encoding method in modern LLMs including LLaMA, PaLM, and GPT-NeoX.

Mathematical formulation

Core intuition

The key insight of RoPE is to encode position information through rotations in complex space:

Treat pairs of dimensions as complex numbers: (x₁, x₂) → x₁ + ix₂
Rotate by an angle proportional to position: rotation(m) = e^(imθ)
The relative position between tokens becomes the difference in rotation angles

2D case
High-dimensional generalization
Complex notation

For a 2D vector at position m, apply rotation by angle mθ:

[x₁']   [cos(mθ)  -sin(mθ)]   [x₁]
[x₂'] = [sin(mθ)   cos(mθ)] × [x₂]

Relative position property: When computing attention between positions m and n:

qₘᵀ kₙ = (Rotation(mθ) × query)ᵀ × (Rotation(nθ) × key)
       = queryᵀ × Rotation((n-m)θ) × key

The attention score depends only on the relative position n-m, not absolute positions.

For dimension d, treat each pair of dimensions independently with different frequencies:

θᵢ = 10000^(-2i/d)  for i = 0, 1, 2, ..., d/2 - 1

This creates a spectrum of rotation frequencies:

Low i: Fast rotation (high frequency, short wavelength)
High i: Slow rotation (low frequency, long wavelength)

Apply rotations to query and key:

For position m:
q'₂ᵢ₊₁ = q₂ᵢ₊₁ × cos(mθᵢ) - q₂ᵢ × sin(mθᵢ)
q'₂ᵢ = q₂ᵢ × cos(mθᵢ) + q₂ᵢ₊₁ × sin(mθᵢ)

Same rotation applied to key k.

Using complex numbers for compact notation:

# Treat adjacent dimensions as complex
q̃ₘ = qₘ × e^(imΘ)
k̃ₙ = kₙ × e^(inΘ)

# Where Θ is diagonal matrix of frequencies
Θ = diag(θ₀, θ₀, θ₁, θ₁, ..., θ_{d/2-1}, θ_{d/2-1})

# Attention score
attention(m,n) = Re(q̃ₘᴴ k̃ₙ)
               = Re(qₘᴴ e^(i(n-m)Θ) kₙ)

The exponential becomes a rotation matrix in real space.

Implementation

Modern LLM implements RoPE efficiently using precomputed sine/cosine factors:

Core RoPE functions
Frequency initialization
Usage in attention

attention.py:190-211

def _apply_rope(self, tensor: Tensor, seq_len: int, offset: int = 0) -> Tensor:
    """Apply rotary position embeddings to queries or keys."""
    # Get precomputed cos/sin factors
    cos, sin = self._get_rope_factors(seq_len + offset, tensor.device, tensor.dtype)
    
    # Handle offset for attention sinks
    cos = cos[offset : offset + seq_len]
    sin = sin[offset : offset + seq_len]
    
    # Add batch and head dimensions for broadcasting
    cos = cos.unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, head_dim)
    sin = sin.unsqueeze(0).unsqueeze(0)
    
    # Apply rotation: x * cos + rotate_half(x) * sin
    return (tensor * cos) + (self._rotate_half(tensor) * sin)

def _get_rope_factors(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> Tuple[Tensor, Tensor]:
    """Compute cos and sin factors for RoPE."""
    # Compute position × frequency for each position and frequency
    freqs = torch.outer(
        torch.arange(seq_len, device=device),  # positions: [0, 1, 2, ...]
        self.inv_freq.to(device=device)        # frequencies: [θ₀, θ₁, ...]
    )
    
    # Apply frequency scaling if configured
    if self.config.rope_scaling:
        freqs = freqs * self.config.rope_scaling
    
    # Compute cos and sin, repeat each frequency for dimension pairs
    cos = torch.cos(freqs).repeat_interleave(2, dim=-1).to(dtype=dtype)
    sin = torch.sin(freqs).repeat_interleave(2, dim=-1).to(dtype=dtype)
    return cos, sin

@staticmethod
def _rotate_half(x: Tensor) -> Tensor:
    """Rotate half the dimensions: [-x₂, x₁, -x₄, x₃, ...]"""
    x1, x2 = x[..., : x.size(-1) // 2], x[..., x.size(-1) // 2 :]
    return torch.cat([-x2, x1], dim=-1)

Frequencies are computed once during initialization:

attention.py:91-94

# Compute inverse frequencies: θᵢ = base^(-2i/d)
inv_freq = 1.0 / (
    config.rope_theta ** (torch.arange(0, self.head_dim, 2).float() / self.head_dim)
)
self.register_buffer("inv_freq", inv_freq, persistent=False)

Why “inverse” frequencies?

We store 1/θᵢ instead of θᵢ
Makes the computation more numerically stable
Matches the formulation in the paper

Default theta (base frequency):

rope_theta = 10000.0  # Default value

This creates frequencies ranging from:

Fastest: 1/1 = 1.0 (for first dimension pair)
Slowest: 1/10000 ≈ 0.0001 (for last dimension pair)

RoPE is applied to queries and keys before attention:

attention.py:102-110

def forward(self, hidden_states: Tensor, attention_mask: Optional[Tensor] = None) -> Tensor:
    batch_size, seq_len, _ = hidden_states.shape
    
    # Project to Q, K, V
    q = self._shape_q(self.q_proj(hidden_states))
    k = self._shape_kv(self.k_proj(hidden_states))
    v = self._shape_kv(self.v_proj(hidden_states))

    # Apply RoPE to queries and keys (not values!)
    if self.config.use_rope:
        offset = self.config.num_attention_sinks if self.config.use_attention_sinks else 0
        q = self._apply_rope(q, seq_len, offset=offset)
        k = self._apply_rope(k, seq_len)

    # Compute attention: softmax(QKᵀ/√d) V
    attn_scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale
    # ...

Important: RoPE is only applied to Q and K, not to V (values).

Key features

Relative position encoding

The fundamental property of RoPE is that attention scores depend only on relative positions:

attention(qₘ, kₙ) = f(qₘ, kₙ, m-n)

Why relative positions matter

Relative positions are more meaningful than absolute positions because:

Translation invariance: “The cat sat on the mat” has the same structure as “Yesterday, the cat sat on the mat”
Generalization: Patterns learned at position 10 apply at position 100
Extrapolation: Model can handle longer sequences than seen during training

Example:

Query at position 5:  "dog"
Key at position 3:    "the"
Relative position:    +2

Query at position 105: "dog" 
Key at position 103:   "the"
Relative position:     +2  (same relationship!)

Proof sketch

For positions m and n, after applying RoPE:

qₘ' = Rotation(mθ) × qₘ
kₙ' = Rotation(nθ) × kₙ

qₘ'ᵀ kₙ' = qₘᵀ Rotation(mθ)ᵀ Rotation(nθ) kₙ
          = qₘᵀ Rotation((n-m)θ) kₙ

Where we used the rotation property:

R(mθ)ᵀ R(nθ) = R(-mθ) R(nθ) = R((n-m)θ)

Thus, the attention score depends only on (n-m), the relative position.

Length extrapolation

RoPE enables models to generalize to longer sequences than seen during training:

Why it works
Position interpolation
Empirical results

Training: Model learns on sequences up to length 2048

Sees relative positions from -2048 to +2048
Learns attention patterns for these ranges

Inference: Generate sequence of length 4096

Relative positions from -4096 to +4096
RoPE rotations are smooth and continuous
Model can interpolate to unseen relative positions

Contrast with learned embeddings:

Learned position embeddings have fixed size (e.g., 2048)
Cannot generate position 2049 (out of bounds)
RoPE has no such limit

For extreme length extrapolation, use position interpolation (Chen et al., 2023):

# Scale positions to fit training range
scale_factor = training_length / inference_length
freqs = freqs * scale_factor

Example: Train on 2048, infer on 8192

rope_scaling = 2048.0 / 8192.0  # 0.25

config = ModernLLMConfig(
    max_seq_len=8192,
    use_rope=True,
    rope_scaling=rope_scaling,
)

This “compresses” positions so the model sees effective positions in its training range.

From Su et al. (2021) and follow-up work:

Method	Train Length	Test Length	Perplexity
Learned embeddings	512	512	24.5
Learned embeddings	512	1024	45.2 ⚠️
RoPE	512	512	24.3
RoPE	512	1024	26.1 ✓
RoPE + interpolation	512	1024	24.9 ✓✓

RoPE maintains reasonable performance even when testing on 2× longer sequences, while learned embeddings degrade significantly.

No additional parameters

Unlike learned position embeddings, RoPE adds zero parameters to the model:

# Learned absolute embeddings (e.g., GPT-2)
self.position_embeddings = nn.Embedding(max_seq_len, d_model)
# Parameters: max_seq_len × d_model = 2048 × 768 = 1,572,864

# RoPE (Modern LLM)
inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
# Parameters: 0 (just a fixed buffer)

For a 12-layer model with d=768, max_seq_len=2048, RoPE saves 1.57M parameters compared to learned position embeddings.

Hyperparameters

Base frequency (theta)

The base frequency controls the wavelength spectrum:

rope_theta = 10000.0  # Default

Effect on frequencies
Tuning guidelines

Higher theta → longer wavelengths → better long-range dependencies:

Theta	Min Wavelength	Max Wavelength	Use case
1,000	6.28	6,280	Short sequences
10,000	6.28	62,800	Standard (2-4K)
100,000	6.28	628,000	Long context (8-16K)
1,000,000	6.28	6,280,000	Extreme length (32K+)

Wavelength calculation:

λᵢ = 2π × theta^(2i/d)

When to increase theta:

Training on sequences >4K tokens
Need better long-range attention
Extrapolating to much longer sequences

When to decrease theta:

Short sequences (under 512 tokens)
Need more fine-grained position information
Hardware memory constraints (shorter wavelengths compress better)

Example:

# Long-context configuration
config = ModernLLMConfig(
    max_seq_len=16384,
    use_rope=True,
    rope_theta=100000.0,  # 10x default for long context
)

Position scaling

For length extrapolation beyond training:

rope_scaling = 0.5  # Compress positions by 2×

Rule of thumb: Set rope_scaling = train_length / target_length when target length is 2-4× training length.Beyond 4×, consider retraining with longer sequences or using advanced techniques like YaRN (Peng et al., 2023).

Comparison to other position encodings

Method	Type	Parameters	Extrapolation	Relative/Absolute
Learned embeddings (GPT-2)	Learned	`L × d`	Poor	Absolute
Sinusoidal (Original Transformer)	Fixed	0	Good	Absolute
Relative embeddings (T5)	Learned	`O(L)`	Moderate	Relative
ALiBi (Press et al., 2022)	Fixed	0	Excellent	Relative
RoPE (Su et al., 2021)	Fixed	0	Excellent	Relative

Modern consensus: RoPE and ALiBi are the top choices for position encoding in 2023+ LLMs. RoPE is more widely adopted and has slightly better performance on benchmarks.

Common issues and solutions

Dimension mismatch errors

Error: RoPE requires an even head dimensionCause: RoPE operates on pairs of dimensions, so head dimension must be even.Solution:

# Ensure head_dim is even
assert config.d_model % config.n_heads == 0
head_dim = config.d_model // config.n_heads
assert head_dim % 2 == 0, f"Head dim {head_dim} must be even"

# Example valid configurations:
config.d_model = 768; config.n_heads = 12  # head_dim=64 ✓
config.d_model = 512; config.n_heads = 8   # head_dim=64 ✓

# Invalid:
config.d_model = 768; config.n_heads = 7   # head_dim≈109.7 ✗

Poor extrapolation

Symptoms: Model performance degrades on sequences longer than training lengthSolutions:

Use position scaling:

rope_scaling = train_length / inference_length

Increase base frequency:
```
rope_theta = 100000.0  # from 10000.0
```

Fine-tune on longer sequences:

# Continue training with longer sequences
config.max_seq_len = 8192  # from 2048
# Train for a few thousand steps

Attention sinks compatibility

Issue: How does RoPE work with attention sinks?Solution: Apply RoPE with position offset for query, normal for keys:

attention.py:108-110

if self.config.use_rope:
    # Query positions are offset by num_sinks
    q = self._apply_rope(q, seq_len, offset=num_attention_sinks)
    # Key positions start at 0 for sinks, then regular sequence
    k = self._apply_rope(k, seq_len)

This ensures sink tokens have positions 0, 1, … and regular tokens continue from there.

Advanced topics

Multi-scale RoPE

Use different frequency ranges for different heads:

# Each head gets different frequency range
for head_idx in range(n_heads):
    scale = 2.0 ** (head_idx / n_heads)  # 1.0 to 2.0
    freqs[head_idx] = base_freqs * scale

Benefits:

Some heads specialize in short-range dependencies
Other heads handle long-range dependencies
Improves representational capacity

2D RoPE for images

Extend RoPE to 2D for vision transformers:

# Separate frequencies for height and width
freqs_h = compute_freqs(height_positions, theta_h)
freqs_w = compute_freqs(width_positions, theta_w)

# Apply to half the dimensions each
x[:, :d//2] = apply_rope(x[:, :d//2], freqs_h)
x[:, d//2:] = apply_rope(x[:, d//2:], freqs_w)

Used in models like ViT-RoPE for image understanding.

YaRN: Yet another RoPE extensioN

Advanced technique for extreme length extrapolation (Peng et al., 2023):

Dynamically adjust frequencies based on attention distance
Temperature scaling for attention scores
Enables 64K+ context with 4K training

See paper: YaRN: Efficient Context Window Extension of Large Language Models

References

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su et al., 2021 - Original RoPE paper

Extending Context Window of Large Language Models via Position Interpolation

Chen et al., 2023 - Position interpolation technique

YaRN: Efficient Context Window Extension

Peng et al., 2023 - Advanced extrapolation method

LLaMA: Open and Efficient Foundation Language Models

Touvron et al., 2023 - Modern LLM using RoPE

Architecture overview

Learn about the full model architecture

Attention sinks

Combine RoPE with attention sinks for long-context generation

Multi-head attention

See full attention implementation

Get Started

Architecture

Training Pipeline

Guides

Overview

Mathematical formulation

Core intuition

Implementation

Key features

Relative position encoding

Length extrapolation

No additional parameters

Hyperparameters

Base frequency (theta)

Position scaling

Comparison to other position encodings

Common issues and solutions

Advanced topics

References

RoFormer: Enhanced Transformer with Rotary Position Embedding

Extending Context Window of Large Language Models via Position Interpolation

YaRN: Efficient Context Window Extension

LLaMA: Open and Efficient Foundation Language Models

See also

Architecture overview

Attention sinks

Multi-head attention

Build docs developers (and LLMs) love

Get Started

Architecture

Training Pipeline

Guides

​Overview

​Mathematical formulation

​Core intuition

​Implementation

​Key features

​Relative position encoding

​Length extrapolation

​No additional parameters

​Hyperparameters

​Base frequency (theta)

​Position scaling

​Comparison to other position encodings

​Common issues and solutions

​Advanced topics

​References

RoFormer: Enhanced Transformer with Rotary Position Embedding

Extending Context Window of Large Language Models via Position Interpolation

YaRN: Efficient Context Window Extension

LLaMA: Open and Efficient Foundation Language Models

​See also

Architecture overview

Attention sinks

Multi-head attention

Build docs developers (and LLMs) love

Overview

Mathematical formulation

Core intuition

Implementation

Key features

Relative position encoding

Length extrapolation

No additional parameters

Hyperparameters

Base frequency (theta)

Position scaling

Comparison to other position encodings

Common issues and solutions

Advanced topics

References

See also