Overview
Phoenix uses a hash-based embedding approach for all categorical features (users, posts, authors). Instead of maintaining a single massive embedding table, the system uses multiple hash functions to map each ID to multiple embedding vectors, which are then combined through learned projections.Why Hash-based Embeddings?
Traditional embedding approaches face scalability challenges:Problem: Massive Vocabulary Size
Problem: Massive Vocabulary Size
With billions of users and posts, a single embedding table would require:
- Memory:
vocab_size × embedding_dim × 4 bytes(e.g., 1B users × 256 dims × 4 bytes = 1TB) - Training: Sparse updates make training inefficient
- Cold start: New users/posts have no embeddings
Solution: Hash Functions
Solution: Hash Functions
Hash functions map IDs to fixed-size buckets:
- Fixed memory:
num_buckets × embedding_dim(e.g., 1M buckets × 256 dims × 4 bytes = 1GB) - Automatic coverage: All IDs (including new ones) get embeddings
- Collision handling: Multiple hash functions reduce collision impact
Architecture
Configuration
The system uses multiple hash functions for each feature type:phoenix/recsys_model.py
Default: Each feature type uses 2 hash functions, providing redundancy and reducing collision impact.
Input Structure
Features are represented as hash arrays:phoenix/recsys_model.py
Hash Embedding Workflow
Step 1: Hash Lookup
Each hash value is used to look up an embedding vector:Step 2: Embedding Combination
Multiple hash embeddings are concatenated and projected to the model dimension:phoenix/recsys_model.py
Step 3: Feature Combination
For history and candidates, multiple feature types (post, author, actions, product surface) are combined:phoenix/recsys_model.py
Example: Processing a History Item
Let’s trace how a single history item is embedded:Benefits
1. Memory Efficiency
- Traditional Approach
- Hash-based Approach
2. Cold Start Handling
New users and posts automatically get embeddings:3. Collision Robustness
Multiple hash functions provide redundancy:- If
hash1(user_A) == hash1(user_B)(collision) - Likely
hash2(user_A) != hash2(user_B)(different embedding) - Projection learns to combine both for unique representation
4. Training Efficiency
Smaller embedding tables mean:- Faster gradient updates
- Better cache utilization
- More uniform training signal (fewer sparse updates)
Trade-offs
Tuning: The number of hash functions and bucket size represent a memory-accuracy trade-off. Phoenix defaults to 2 hash functions as a sweet spot.
Implementation in Retrieval
The retrieval model uses the same hash-based embedding approach:phoenix/recsys_retrieval_model.py
Related Concepts
Candidate Isolation
Learn how candidates are scored independently in the ranking transformer
Multi-action Prediction
See how hash embeddings feed into multi-task prediction heads