Skip to main content

Overview

The Phoenix recommendation system makes several deliberate architectural choices that differentiate it from traditional approaches. This page explains the why behind these decisions, exploring the trade-offs and benefits of each choice.

1. No Hand-Engineered Features

The Decision

Phoenix relies entirely on the Grok-based transformer to learn relevance from user engagement sequences. There are no manual feature engineering for content relevance.

Rationale

Traditional recommendation systems require extensive manual feature engineering:
# Traditional approach: hundreds of hand-crafted features
features = {
    'post_age_hours': calculate_age(post),
    'author_follower_count': get_follower_count(author),
    'user_author_interaction_count': count_interactions(user, author),
    'post_reply_count': get_reply_count(post),
    'post_like_count': get_like_count(post),
    'text_length': len(post.text),
    'has_media': post.media is not None,
    'is_verified_author': author.verified,
    # ... hundreds more ...
}
Issues:
  • Requires domain expertise and constant iteration
  • Each feature needs its own data pipeline
  • Feature interactions are hard to capture manually
  • Maintenance burden grows over time
  • Different features for different content types
Phoenix learns directly from raw engagement sequences:
# Phoenix approach: learn from behavior patterns
input = {
    'user': user_id,
    'history': [
        (post_1, author_1, 'like'),
        (post_2, author_2, 'reply'),
        (post_3, author_3, 'repost'),
        # User's engagement sequence
    ],
    'candidates': [post_A, post_B, post_C],
}

# Transformer discovers patterns automatically
scores = transformer(input)
Benefits:
  • No feature engineering required
  • Model discovers complex patterns automatically
  • Single unified architecture for all content types
  • Simpler data pipelines (only IDs and actions)

Impact

Infrastructure

  • 10× simpler data pipelines: Only need to store user IDs, post IDs, and actions
  • No feature stores: Eliminate complex feature computation infrastructure
  • Faster iteration: Deploy model improvements without updating feature pipelines

Performance

  • Better generalization: Model learns from behavior, not hand-tuned proxies
  • Automatic adaptation: Learns new patterns without manual intervention
  • Unified understanding: Same architecture handles all content types
From README.md: “We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting.”

2. Candidate Isolation in Ranking

The Decision

During transformer inference, candidates cannot attend to each other—only to the user context. This is enforced through a custom attention mask.

Rationale

Standard transformer attention allows all positions to interact:
# Standard transformer: candidates can see each other
attention_scores = {
    'candidate_A → candidate_B': 0.3,  # A sees B
    'candidate_A → candidate_C': 0.2,  # A sees C
    'candidate_B → candidate_A': 0.4,  # B sees A
}

# Consequence: A's score depends on B and C being in the batch
score(A | batch=[A,B,C]) ≠ score(A | batch=[A,D,E])
Issues:
  • Scores are batch-dependent and inconsistent
  • Cannot cache scores (different batches = different scores)
  • Cannot pre-compute scores offline
  • A/B tests are unreliable due to batch effects
  • Model may learn to game batch composition

Impact

# Example: Caching enables efficient serving

# Day 1: Compute scores for popular candidates
cache = {}
for candidate in popular_candidates:
    cache[(user_id, candidate_id)] = model.score(user, candidate)

# Day 2: Serve from cache, only score new candidates
def get_feed(user_id):
    candidates = retrieve_candidates(user_id)
    
    scores = []
    for candidate in candidates:
        if (user_id, candidate.id) in cache:
            scores.append(cache[(user_id, candidate.id)])  # Cache hit
        else:
            scores.append(model.score(user, candidate))     # Score on-demand
    
    return rank_by_score(candidates, scores)
See the Candidate Isolation page for detailed implementation.

3. Hash-based Embeddings

The Decision

Both retrieval and ranking use multiple hash functions for embedding lookup instead of traditional embedding tables.

Rationale

Traditional embedding tables don’t scale to billions of entities:
# Traditional approach
user_embeddings = nn.Embedding(
    num_users=1_000_000_000,     # 1B users
    embedding_dim=256
)
# Memory: 1B × 256 × 4 bytes = 1 TB

post_embeddings = nn.Embedding(
    num_posts=10_000_000_000,    # 10B posts
    embedding_dim=256
)
# Memory: 10B × 256 × 4 bytes = 10 TB

# Total: 11+ TB just for embeddings
Issues:
  • Prohibitive memory requirements
  • Slow training (sparse gradient updates)
  • Cold start: new users/posts have no embeddings
  • Cannot handle growing vocabulary
Hash functions map IDs to fixed-size buckets:
phoenix/recsys_model.py
@dataclass
class HashConfig:
    num_user_hashes: int = 2
    num_item_hashes: int = 2
    num_author_hashes: int = 2

# Map any ID to 2 hash values
def get_embedding(entity_id):
    hash_1 = hash_fn_1(entity_id) % num_buckets
    hash_2 = hash_fn_2(entity_id) % num_buckets
    
    emb_1 = embedding_table[hash_1]  # [D]
    emb_2 = embedding_table[hash_2]  # [D]
    
    # Combine via learned projection
    combined = concat([emb_1, emb_2])  # [2D]
    return projection @ combined        # [D]
# Fixed memory regardless of vocabulary size
embedding_table = nn.Embedding(
    num_buckets=10_000_000,      # 10M buckets
    embedding_dim=256
)
# Memory: 10M × 256 × 4 bytes = 10 GB (1000× reduction)

Why Multiple Hash Functions?

Using 2 hash functions provides collision robustness:
# Single hash: collision = identical embeddings
user_A_id = 123
user_B_id = 456

if hash1(user_A_id) == hash1(user_B_id):  # Collision!
    embedding_A == embedding_B             # Identical

# Multiple hashes: collision is unlikely across all hashes
if hash1(user_A_id) == hash1(user_B_id):  # Collision on hash1
    if hash2(user_A_id) == hash2(user_B_id):  # Extremely unlikely
        # Still different after projection
        embedding_A = project([emb1, emb2])
        embedding_B = project([emb1, emb2])
    else:
        # Different on hash2
        embedding_A = project([emb1_A, emb2_A])  # Different
        embedding_B = project([emb1_B, emb2_B])  # Different

Trade-offs

  • 1000× memory reduction: 10 GB vs 11 TB
  • Cold start handling: New entities automatically get embeddings
  • Fixed capacity: No need to resize tables as vocabulary grows
  • Faster training: Denser gradient updates
From README.md: “Both retrieval and ranking use multiple hash functions for embedding lookup” — Phoenix defaults to 2 hash functions as a sweet spot between memory and accuracy.
See the Hash-based Embeddings page for implementation details.

4. Multi-Action Prediction

The Decision

Rather than predicting a single “relevance” score, Phoenix predicts probabilities for 14+ different actions (like, reply, repost, block, etc.).

Rationale

A single relevance score cannot capture engagement nuance:
# What does a single score mean?
relevance_score = 0.73  # High score, but...

# Possible interpretations:
# - User will click but not engage?
# - User will like but might block author later?
# - High engagement but low quality (clickbait)?
# - Informative but not entertaining?

# Cannot distinguish these scenarios!
Issues:
  • Treats all engagement as equivalent
  • No signal for negative actions (mute, block, report)
  • Cannot optimize for different product goals
  • Poor calibration (what does 0.73 mean?)
Predict probabilities for specific actions:
predictions = {
    'P(like)': 0.35,           # Likely to like
    'P(reply)': 0.05,          # Unlikely to reply
    'P(repost)': 0.02,         # Unlikely to repost
    'P(click)': 0.60,          # Likely to click
    'P(dwell_long)': 0.40,     # Moderate dwell time
    'P(follow_author)': 0.03,  # Unlikely to follow
    'P(block)': 0.001,         # Very unlikely to block
    'P(report)': 0.0001,       # Extremely unlikely to report
}

# Clear interpretation: passive engagement, low deep engagement
Benefits:
  • Nuanced understanding: Distinguish passive vs active engagement
  • Negative signals: Avoid content likely to be blocked/reported
  • Flexible optimization: Different weights for different goals
  • Better calibration: Probabilities are interpretable

Flexible Weighting

The same model serves different product objectives:
# Optimize for deep engagement
engagement_weights = {
    'like': 1.0,
    'reply': 3.0,       # Value conversations highly
    'repost': 2.0,
    'click': 0.1,       # Low weight for passive actions
    'block': -10.0,
}

# Optimize for discovery
discovery_weights = {
    'like': 1.0,
    'follow_author': 10.0,  # Value new connections
    'profile_click': 2.0,   # Value exploration
    'click': 1.0,
    'block': -10.0,
}

# Optimize for safety
safety_weights = {
    'like': 1.0,
    'report': -50.0,    # Strongly avoid reportable content
    'block': -30.0,     # Strongly avoid blockable content
    'not_interested': -10.0,
}

# Same model, different objectives!

Multi-Task Learning Benefits

Predicting multiple actions improves generalization:
# Rare actions benefit from common actions
training_data = {
    'like': 1_000_000 examples,      # Common: strong signal
    'reply': 100_000 examples,       # Less common
    'report': 1_000 examples,        # Rare: weak signal alone
}

# Shared transformer learns features useful for all actions
# "report" benefits from patterns learned for "like" and "reply"
# Result: better predictions even for rare actions

Impact

Model Quality

  • Better calibrated probabilities
  • Robust to rare events via multi-task learning
  • Captures full spectrum of user behavior

Product Flexibility

  • Tune weights without retraining model
  • A/B test different optimization objectives
  • Adapt to changing product priorities
See the Multi-Action Prediction page for architecture details.

5. Composable Pipeline Architecture

The Decision

The candidate-pipeline framework provides a trait-based composable architecture for building recommendation pipelines.

Rationale

Traditional recommendation systems tightly couple business logic with execution:
# Monolithic approach: everything in one place
def get_recommendations(user_id):
    # Fetching
    in_network = fetch_in_network(user_id)
    out_network = fetch_out_network(user_id)
    
    # Hydration
    candidates = hydrate_metadata(in_network + out_network)
    
    # Filtering
    candidates = remove_duplicates(candidates)
    candidates = remove_blocked_authors(candidates, user_id)
    candidates = remove_muted_keywords(candidates, user_id)
    
    # Scoring
    for candidate in candidates:
        candidate.score = ml_model.score(user_id, candidate)
    
    # Selection
    return sorted(candidates, key=lambda c: c.score, reverse=True)[:100]

# Issues:
# - Hard to test individual stages
# - Difficult to add new stages
# - No parallelization of independent stages
# - Tight coupling between stages
Separate concerns via traits:
// Each stage is a composable trait
pub trait Source {
    fn fetch_candidates(&self, query: &Query) -> Vec<Candidate>;
}

pub trait Hydrator {
    fn hydrate(&self, candidates: &mut [Candidate]);
}

pub trait Filter {
    fn filter(&self, candidates: Vec<Candidate>) -> Vec<Candidate>;
}

pub trait Scorer {
    fn score(&self, candidates: &mut [Candidate]);
}

// Compose into pipeline
let pipeline = CandidatePipeline::builder()
    .add_source(ThunderSource::new())           // In-network
    .add_source(PhoenixRetrievalSource::new())  // Out-of-network
    .add_hydrator(CoreDataHydrator::new())
    .add_filter(DeduplicateFilter::new())
    .add_filter(BlockedAuthorsFilter::new())
    .add_scorer(PhoenixRankingScorer::new())
    .add_scorer(AuthorDiversityScorer::new())
    .build();

Benefits

// Easy to add new stages
pipeline.add_filter(NewFeatureFilter::new());

// Easy to swap implementations
pipeline.replace_scorer(
    old: PhoenixV1Scorer,
    new: PhoenixV2Scorer,
);

// Easy to A/B test
if experiment.is_enabled(user_id) {
    pipeline.add_scorer(ExperimentalScorer::new());
}

Impact

Development Velocity

  • Add new features without touching existing code
  • Test stages in isolation
  • Easy experimentation and A/B testing

Performance

  • Automatic parallelization of independent stages
  • Efficient resource utilization
  • Built-in error handling and retries
From README.md: “The framework runs sources and hydrators in parallel where possible, with configurable error handling and logging.”

Summary

The Phoenix recommendation system makes five key architectural choices:
DecisionBenefitTrade-off
No hand-engineered featuresSimpler infrastructure, automatic pattern discoveryRequires more training data
Candidate isolationConsistent scores, cacheabilitySlightly lower model expressiveness
Hash-based embeddings1000× memory reduction, cold start handlingInformation loss from collisions
Multi-action predictionNuanced understanding, flexible optimizationMore complex tuning
Composable pipelineModularity, parallelization, easy experimentationFramework overhead
These decisions collectively enable Phoenix to:
  • Scale to billions of users and posts
  • Adapt to changing user behavior without manual intervention
  • Iterate rapidly on model and product improvements
  • Serve predictions with low latency and high throughput

Candidate Isolation

Deep dive into attention masking

Hash-based Embeddings

Implementation details

Multi-action Prediction

Action types and weighting

Build docs developers (and LLMs) love