Design Decisions - X For You Feed Algorithm

Overview

The Phoenix recommendation system makes several deliberate architectural choices that differentiate it from traditional approaches. This page explains the why behind these decisions, exploring the trade-offs and benefits of each choice.

1. No Hand-Engineered Features

The Decision

Phoenix relies entirely on the Grok-based transformer to learn relevance from user engagement sequences. There are no manual feature engineering for content relevance.

Rationale

Problem: Feature Engineering Complexity

Traditional recommendation systems require extensive manual feature engineering:

# Traditional approach: hundreds of hand-crafted features
features = {
    'post_age_hours': calculate_age(post),
    'author_follower_count': get_follower_count(author),
    'user_author_interaction_count': count_interactions(user, author),
    'post_reply_count': get_reply_count(post),
    'post_like_count': get_like_count(post),
    'text_length': len(post.text),
    'has_media': post.media is not None,
    'is_verified_author': author.verified,
    # ... hundreds more ...
}

Issues:

Requires domain expertise and constant iteration
Each feature needs its own data pipeline
Feature interactions are hard to capture manually
Maintenance burden grows over time
Different features for different content types

Solution: Transformer Learning

Phoenix learns directly from raw engagement sequences:

# Phoenix approach: learn from behavior patterns
input = {
    'user': user_id,
    'history': [
        (post_1, author_1, 'like'),
        (post_2, author_2, 'reply'),
        (post_3, author_3, 'repost'),
        # User's engagement sequence
    ],
    'candidates': [post_A, post_B, post_C],
}

# Transformer discovers patterns automatically
scores = transformer(input)

Benefits:

No feature engineering required
Model discovers complex patterns automatically
Single unified architecture for all content types
Simpler data pipelines (only IDs and actions)

Impact

Infrastructure

10× simpler data pipelines: Only need to store user IDs, post IDs, and actions
No feature stores: Eliminate complex feature computation infrastructure
Faster iteration: Deploy model improvements without updating feature pipelines

Performance

Better generalization: Model learns from behavior, not hand-tuned proxies
Automatic adaptation: Learns new patterns without manual intervention
Unified understanding: Same architecture handles all content types

From README.md: “We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting.”

2. Candidate Isolation in Ranking

The Decision

During transformer inference, candidates cannot attend to each other—only to the user context. This is enforced through a custom attention mask.

Rationale

Problem
Solution

Standard transformer attention allows all positions to interact:

# Standard transformer: candidates can see each other
attention_scores = {
    'candidate_A → candidate_B': 0.3,  # A sees B
    'candidate_A → candidate_C': 0.2,  # A sees C
    'candidate_B → candidate_A': 0.4,  # B sees A
}

# Consequence: A's score depends on B and C being in the batch
score(A | batch=[A,B,C]) ≠ score(A | batch=[A,D,E])

Issues:

Scores are batch-dependent and inconsistent
Cannot cache scores (different batches = different scores)
Cannot pre-compute scores offline
A/B tests are unreliable due to batch effects
Model may learn to game batch composition

Candidate isolation ensures consistent scoring:

phoenix/grok.py

# Phoenix: candidates only see user + history
def make_recsys_attn_mask(seq_len, candidate_start_offset):
    # Start with causal mask
    mask = jnp.tril(jnp.ones((seq_len, seq_len)))
    
    # Zero out candidate-to-candidate attention
    mask[candidate_start_offset:, candidate_start_offset:] = 0
    
    # Add back self-attention (diagonal)
    for i in range(candidate_start_offset, seq_len):
        mask[i, i] = 1
    
    return mask

Benefits:

Scores are batch-independent and consistent
Can cache candidate scores per user
Can pre-compute scores offline
Reliable A/B testing
Score reflects true user-candidate relevance

Impact

# Example: Caching enables efficient serving

# Day 1: Compute scores for popular candidates
cache = {}
for candidate in popular_candidates:
    cache[(user_id, candidate_id)] = model.score(user, candidate)

# Day 2: Serve from cache, only score new candidates
def get_feed(user_id):
    candidates = retrieve_candidates(user_id)
    
    scores = []
    for candidate in candidates:
        if (user_id, candidate.id) in cache:
            scores.append(cache[(user_id, candidate.id)])  # Cache hit
        else:
            scores.append(model.score(user, candidate))     # Score on-demand
    
    return rank_by_score(candidates, scores)

See the Candidate Isolation page for detailed implementation.

3. Hash-based Embeddings

The Decision

Both retrieval and ranking use multiple hash functions for embedding lookup instead of traditional embedding tables.

Rationale

Problem: Scale

Traditional embedding tables don’t scale to billions of entities:

# Traditional approach
user_embeddings = nn.Embedding(
    num_users=1_000_000_000,     # 1B users
    embedding_dim=256
)
# Memory: 1B × 256 × 4 bytes = 1 TB

post_embeddings = nn.Embedding(
    num_posts=10_000_000_000,    # 10B posts
    embedding_dim=256
)
# Memory: 10B × 256 × 4 bytes = 10 TB

# Total: 11+ TB just for embeddings

Issues:

Prohibitive memory requirements
Slow training (sparse gradient updates)
Cold start: new users/posts have no embeddings
Cannot handle growing vocabulary

Solution: Hashing

Hash functions map IDs to fixed-size buckets:

phoenix/recsys_model.py

@dataclass
class HashConfig:
    num_user_hashes: int = 2
    num_item_hashes: int = 2
    num_author_hashes: int = 2

# Map any ID to 2 hash values
def get_embedding(entity_id):
    hash_1 = hash_fn_1(entity_id) % num_buckets
    hash_2 = hash_fn_2(entity_id) % num_buckets
    
    emb_1 = embedding_table[hash_1]  # [D]
    emb_2 = embedding_table[hash_2]  # [D]
    
    # Combine via learned projection
    combined = concat([emb_1, emb_2])  # [2D]
    return projection @ combined        # [D]

# Fixed memory regardless of vocabulary size
embedding_table = nn.Embedding(
    num_buckets=10_000_000,      # 10M buckets
    embedding_dim=256
)
# Memory: 10M × 256 × 4 bytes = 10 GB (1000× reduction)

Why Multiple Hash Functions?

Using 2 hash functions provides collision robustness:

# Single hash: collision = identical embeddings
user_A_id = 123
user_B_id = 456

if hash1(user_A_id) == hash1(user_B_id):  # Collision!
    embedding_A == embedding_B             # Identical

# Multiple hashes: collision is unlikely across all hashes
if hash1(user_A_id) == hash1(user_B_id):  # Collision on hash1
    if hash2(user_A_id) == hash2(user_B_id):  # Extremely unlikely
        # Still different after projection
        embedding_A = project([emb1, emb2])
        embedding_B = project([emb1, emb2])
    else:
        # Different on hash2
        embedding_A = project([emb1_A, emb2_A])  # Different
        embedding_B = project([emb1_B, emb2_B])  # Different

Trade-offs

Advantages
Disadvantages

1000× memory reduction: 10 GB vs 11 TB
Cold start handling: New entities automatically get embeddings
Fixed capacity: No need to resize tables as vocabulary grows
Faster training: Denser gradient updates

From README.md: “Both retrieval and ranking use multiple hash functions for embedding lookup” — Phoenix defaults to 2 hash functions as a sweet spot between memory and accuracy.

See the Hash-based Embeddings page for implementation details.

4. Multi-Action Prediction

The Decision

Rather than predicting a single “relevance” score, Phoenix predicts probabilities for 14+ different actions (like, reply, repost, block, etc.).

Rationale

Problem: Single Score Limitations

A single relevance score cannot capture engagement nuance:

# What does a single score mean?
relevance_score = 0.73  # High score, but...

# Possible interpretations:
# - User will click but not engage?
# - User will like but might block author later?
# - High engagement but low quality (clickbait)?
# - Informative but not entertaining?

# Cannot distinguish these scenarios!

Issues:

Treats all engagement as equivalent
No signal for negative actions (mute, block, report)
Cannot optimize for different product goals
Poor calibration (what does 0.73 mean?)

Solution: Multi-Action

Predict probabilities for specific actions:

predictions = {
    'P(like)': 0.35,           # Likely to like
    'P(reply)': 0.05,          # Unlikely to reply
    'P(repost)': 0.02,         # Unlikely to repost
    'P(click)': 0.60,          # Likely to click
    'P(dwell_long)': 0.40,     # Moderate dwell time
    'P(follow_author)': 0.03,  # Unlikely to follow
    'P(block)': 0.001,         # Very unlikely to block
    'P(report)': 0.0001,       # Extremely unlikely to report
}

# Clear interpretation: passive engagement, low deep engagement

Benefits:

Nuanced understanding: Distinguish passive vs active engagement
Negative signals: Avoid content likely to be blocked/reported
Flexible optimization: Different weights for different goals
Better calibration: Probabilities are interpretable

Flexible Weighting

The same model serves different product objectives:

# Optimize for deep engagement
engagement_weights = {
    'like': 1.0,
    'reply': 3.0,       # Value conversations highly
    'repost': 2.0,
    'click': 0.1,       # Low weight for passive actions
    'block': -10.0,
}

# Optimize for discovery
discovery_weights = {
    'like': 1.0,
    'follow_author': 10.0,  # Value new connections
    'profile_click': 2.0,   # Value exploration
    'click': 1.0,
    'block': -10.0,
}

# Optimize for safety
safety_weights = {
    'like': 1.0,
    'report': -50.0,    # Strongly avoid reportable content
    'block': -30.0,     # Strongly avoid blockable content
    'not_interested': -10.0,
}

# Same model, different objectives!

Multi-Task Learning Benefits

Predicting multiple actions improves generalization:

# Rare actions benefit from common actions
training_data = {
    'like': 1_000_000 examples,      # Common: strong signal
    'reply': 100_000 examples,       # Less common
    'report': 1_000 examples,        # Rare: weak signal alone
}

# Shared transformer learns features useful for all actions
# "report" benefits from patterns learned for "like" and "reply"
# Result: better predictions even for rare actions

Impact

Model Quality

Better calibrated probabilities
Robust to rare events via multi-task learning
Captures full spectrum of user behavior

Product Flexibility

Tune weights without retraining model
A/B test different optimization objectives
Adapt to changing product priorities

See the Multi-Action Prediction page for architecture details.

5. Composable Pipeline Architecture

The Decision

The candidate-pipeline framework provides a trait-based composable architecture for building recommendation pipelines.

Rationale

Problem: Monolithic Pipelines

Traditional recommendation systems tightly couple business logic with execution:

# Monolithic approach: everything in one place
def get_recommendations(user_id):
    # Fetching
    in_network = fetch_in_network(user_id)
    out_network = fetch_out_network(user_id)
    
    # Hydration
    candidates = hydrate_metadata(in_network + out_network)
    
    # Filtering
    candidates = remove_duplicates(candidates)
    candidates = remove_blocked_authors(candidates, user_id)
    candidates = remove_muted_keywords(candidates, user_id)
    
    # Scoring
    for candidate in candidates:
        candidate.score = ml_model.score(user_id, candidate)
    
    # Selection
    return sorted(candidates, key=lambda c: c.score, reverse=True)[:100]

# Issues:
# - Hard to test individual stages
# - Difficult to add new stages
# - No parallelization of independent stages
# - Tight coupling between stages

Solution: Trait-based Composition

Separate concerns via traits:

// Each stage is a composable trait
pub trait Source {
    fn fetch_candidates(&self, query: &Query) -> Vec<Candidate>;
}

pub trait Hydrator {
    fn hydrate(&self, candidates: &mut [Candidate]);
}

pub trait Filter {
    fn filter(&self, candidates: Vec<Candidate>) -> Vec<Candidate>;
}

pub trait Scorer {
    fn score(&self, candidates: &mut [Candidate]);
}

// Compose into pipeline
let pipeline = CandidatePipeline::builder()
    .add_source(ThunderSource::new())           // In-network
    .add_source(PhoenixRetrievalSource::new())  // Out-of-network
    .add_hydrator(CoreDataHydrator::new())
    .add_filter(DeduplicateFilter::new())
    .add_filter(BlockedAuthorsFilter::new())
    .add_scorer(PhoenixRankingScorer::new())
    .add_scorer(AuthorDiversityScorer::new())
    .build();

Benefits

Modularity
Parallelization
Observability

// Easy to add new stages
pipeline.add_filter(NewFeatureFilter::new());

// Easy to swap implementations
pipeline.replace_scorer(
    old: PhoenixV1Scorer,
    new: PhoenixV2Scorer,
);

// Easy to A/B test
if experiment.is_enabled(user_id) {
    pipeline.add_scorer(ExperimentalScorer::new());
}

// Framework automatically parallelizes independent sources
let candidates = join!(
    thunder_source.fetch(query),     // Parallel
    retrieval_source.fetch(query),   // Parallel
);

// Hydrators run in parallel where possible
let hydrated = parallel_hydrate(candidates, [
    CoreDataHydrator,
    AuthorInfoHydrator,
    MediaHydrator,
]);

// Framework provides automatic instrumentation
pipeline.run(query).with_tracing(|stage| {
    log::info!("Stage: {}, Duration: {:?}", stage.name, stage.duration);
});

// Output:
// Stage: ThunderSource, Duration: 2ms
// Stage: PhoenixRetrieval, Duration: 15ms
// Stage: CoreDataHydrator, Duration: 5ms
// Stage: PhoenixScorer, Duration: 50ms

Impact

Development Velocity

Add new features without touching existing code
Test stages in isolation
Easy experimentation and A/B testing

Performance

Automatic parallelization of independent stages
Efficient resource utilization
Built-in error handling and retries

From README.md: “The framework runs sources and hydrators in parallel where possible, with configurable error handling and logging.”

Summary

The Phoenix recommendation system makes five key architectural choices:

Decision	Benefit	Trade-off
No hand-engineered features	Simpler infrastructure, automatic pattern discovery	Requires more training data
Candidate isolation	Consistent scores, cacheability	Slightly lower model expressiveness
Hash-based embeddings	1000× memory reduction, cold start handling	Information loss from collisions
Multi-action prediction	Nuanced understanding, flexible optimization	More complex tuning
Composable pipeline	Modularity, parallelization, easy experimentation	Framework overhead

These decisions collectively enable Phoenix to:

Scale to billions of users and posts
Adapt to changing user behavior without manual intervention
Iterate rapidly on model and product improvements
Serve predictions with low latency and high throughput

Candidate Isolation

Deep dive into attention masking

Hash-based Embeddings

Implementation details

Multi-action Prediction

Action types and weighting

Getting Started

Core Components

How It Works

Phoenix ML System

Implementation

Key Concepts

​Overview

​1. No Hand-Engineered Features

​The Decision

​Rationale

​Impact

Infrastructure

Performance

​2. Candidate Isolation in Ranking

​The Decision

​Rationale

​Impact

​3. Hash-based Embeddings

​The Decision

​Rationale

​Why Multiple Hash Functions?

​Trade-offs

​4. Multi-Action Prediction

​The Decision

​Rationale

​Flexible Weighting

​Multi-Task Learning Benefits

​Impact

Model Quality

Product Flexibility

​5. Composable Pipeline Architecture

​The Decision

​Rationale

​Benefits

​Impact

Development Velocity

Performance

​Summary

Candidate Isolation

Hash-based Embeddings

Multi-action Prediction

Build docs developers (and LLMs) love

Overview

1. No Hand-Engineered Features

The Decision

Rationale

Impact

2. Candidate Isolation in Ranking

The Decision

Rationale

Impact

3. Hash-based Embeddings

The Decision

Rationale

Why Multiple Hash Functions?

Trade-offs

4. Multi-Action Prediction

The Decision

Rationale

Flexible Weighting

Multi-Task Learning Benefits

Impact

5. Composable Pipeline Architecture

The Decision

Rationale

Benefits

Impact

Summary