Overview
Filters remove candidates that should not be shown to users. The pipeline applies filters at two stages:- Pre-Scoring Filters: Run before ML scoring to reduce compute costs
- Post-Selection Filters: Run after selection for final validation
Filters are applied sequentially, with each filter receiving the
kept candidates from the previous filter.Filter Architecture
All filters implement theFilter trait:
- Observability: Track how many candidates each filter removes
- Debugging: Inspect which candidates were filtered and why
- Metrics: Monitor filter effectiveness and false positive rates
Pre-Scoring Filters
These filters run before the Phoenix scorer to reduce the number of candidates that need ML predictions.DropDuplicatesFilter
Purpose: Remove duplicate post IDs within the same candidate batch. Implementation:home-mixer/filters/drop_duplicates_filter.rs
CoreDataHydrationFilter
Purpose: Remove posts that failed to hydrate core metadata. Removes candidates missing:- Post text
- Author information
- Timestamps
- Media URLs (for media posts)
AgeFilter
Purpose: Remove posts older than a specified maximum age. Default threshold: Typically 24-48 hours for For You feed Implementation:home-mixer/filters/age_filter.rs
Post IDs use Snowflake encoding, which embeds the creation timestamp. This allows age calculation without database lookups.
SelfTweetFilter
Purpose: Remove posts authored by the viewing user. Implementation:home-mixer/filters/self_tweet_filter.rs
RetweetDeduplicationFilter
Purpose: Deduplicate reposts, keeping only the first occurrence of a post (whether as original or repost). Algorithm:- Track seen post IDs (both original and reposted)
- For each candidate:
- If it’s a repost: Check if the original post ID has been seen
- If it’s an original: Mark the post ID as seen
- Keep only the first occurrence
home-mixer/filters/retweet_deduplication_filter.rs
IneligibleSubscriptionFilter
Purpose: Remove paywalled/subscription content that the user cannot access. Removes:- Posts requiring paid subscription when user is not subscribed
- Premium content behind authentication walls
PreviouslySeenPostsFilter
Purpose: Filter out posts the user has already seen in previous sessions. Uses two mechanisms:- Bloom Filters
- Explicit Seen IDs
Probabilistic data structures sent from the client containing hashes of previously seen post IDs.Advantages:
- Memory efficient (can represent millions of IDs in a few KB)
- Fast lookups (O(1))
- False positives possible (may filter a post that wasn’t actually seen)
- No false negatives (will never show a post that was definitely seen)
home-mixer/filters/previously_seen_posts_filter.rs
PreviouslyServedPostsFilter
Purpose: Remove posts that were already served earlier in the same request session. Difference from PreviouslySeenPostsFilter:- PreviouslySeenPostsFilter: Posts seen in past sessions (stored client-side)
- PreviouslyServedPostsFilter: Posts served in this pagination sequence (server-side tracking)
MutedKeywordFilter
Purpose: Remove posts containing keywords the user has muted. Algorithm:- Tokenize user’s muted keywords
- Tokenize post text
- Check for matches using pattern matching
home-mixer/filters/muted_keyword_filter.rs
Tokenization is required because users may mute phrases (“word1 word2”) rather than exact strings.
AuthorSocialgraphFilter
Purpose: Remove posts from authors the user has blocked or muted. Implementation:home-mixer/filters/author_socialgraph_filter.rs
Post-Selection Filters
These filters run after selection on the top K candidates. They handle expensive operations or final validation.VFFilter (Visibility Filtering)
Purpose: Remove posts that violate content policies or are otherwise ineligible for display. Removes posts that are:- Deleted
- Marked as spam
- Containing violence or gore
- NSFW content (when user has safe mode enabled)
- Copyright violations
- Content from suspended accounts
home-mixer/filters/vf_filter.rs
DedupConversationFilter
Purpose: Deduplicate multiple branches of the same conversation thread. Problem it solves: If a viral post has many popular replies, the feed could show:- Original post
- Reply A to original post
- Reply B to original post
- Reply C to original post
Filter Performance
Typical Removal Rates
Based on production metrics for a typical For You feed request:| Filter | Removal Rate | Description |
|---|---|---|
| DropDuplicatesFilter | 1-3% | Low duplicate rate from sources |
| CoreDataHydrationFilter | 0.5-1% | Most candidates hydrate successfully |
| AgeFilter | 5-10% | Depends on source recency |
| SelfTweetFilter | 0.1-0.5% | Few self-posts in sources |
| RetweetDeduplicationFilter | 10-15% | Many viral posts get reposted |
| IneligibleSubscriptionFilter | 2-5% | Subscription content is limited |
| PreviouslySeenPostsFilter | 20-40% | High removal rate for engaged users |
| PreviouslyServedPostsFilter | 5-10% | Pagination deduplication |
| MutedKeywordFilter | 1-3% | Most users have few muted keywords |
| AuthorSocialgraphFilter | 3-8% | Blocks/mutes accumulate over time |
| VFFilter | 1-2% | Most selected content is policy-compliant |
| DedupConversationFilter | 2-5% | Occasional conversation clusters |
Latency Impact
-
Pre-Scoring Filters (Total): 5-15ms
- Most filters are in-memory operations (O(n) or O(n log n))
- AuthorSocialgraphFilter and MutedKeywordFilter are slightly more expensive
-
Post-Selection Filters (Total): 5-10ms
- VFFilter may call external services (cached aggressively)
- Runs on small candidate set (top K only)
Observability
Each filter emits metrics:- Monitoring: Alert when filters remove unexpectedly high/low amounts
- Debugging: Understand why specific candidates were filtered
- Optimization: Identify slow filters that need optimization
Related Pages
Pipeline Stages
Complete overview of the recommendation pipeline
Scoring and Ranking
How candidates are scored after filtering
Candidate Hydration
How candidate metadata is enriched before filtering
User Features
User preferences and socialgraph used by filters