Filtering

Overview

Filters remove candidates that should not be shown to users. The pipeline applies filters at two stages:

Pre-Scoring Filters: Run before ML scoring to reduce compute costs
Post-Selection Filters: Run after selection for final validation

Filters are applied sequentially, with each filter receiving the kept candidates from the previous filter.

Filter Architecture

All filters implement the Filter trait:

#[async_trait]
pub trait Filter<Q, C> {
    async fn filter(
        &self,
        query: &Q,
        candidates: Vec<C>,
    ) -> Result<FilterResult<C>, String>;
}

pub struct FilterResult<C> {
    pub kept: Vec<C>,
    pub removed: Vec<C>,
}

This enables:

Observability: Track how many candidates each filter removes
Debugging: Inspect which candidates were filtered and why
Metrics: Monitor filter effectiveness and false positive rates

Pre-Scoring Filters

These filters run before the Phoenix scorer to reduce the number of candidates that need ML predictions.

DropDuplicatesFilter

Purpose: Remove duplicate post IDs within the same candidate batch. Implementation:

home-mixer/filters/drop_duplicates_filter.rs

let mut seen_ids = HashSet::new();
for candidate in candidates {
    if seen_ids.insert(candidate.tweet_id) {
        kept.push(candidate);
    } else {
        removed.push(candidate);
    }
}

Why this matters: Duplicates can occur when multiple sources return the same post (e.g., Thunder and Phoenix both surface a viral post).

CoreDataHydrationFilter

Purpose: Remove posts that failed to hydrate core metadata. Removes candidates missing:

Post text
Author information
Timestamps
Media URLs (for media posts)

Why this matters: Candidates without core data cannot be properly displayed or scored.

AgeFilter

Purpose: Remove posts older than a specified maximum age. Default threshold: Typically 24-48 hours for For You feed Implementation:

home-mixer/filters/age_filter.rs

pub struct AgeFilter {
    pub max_age: Duration,
}

fn is_within_age(&self, tweet_id: i64) -> bool {
    snowflake::duration_since_creation_opt(tweet_id)
        .map(|age| age <= self.max_age)
        .unwrap_or(false)
}

Post IDs use Snowflake encoding, which embeds the creation timestamp. This allows age calculation without database lookups.

Why this matters: Users expect fresh, recent content in their For You feed.

SelfTweetFilter

Purpose: Remove posts authored by the viewing user. Implementation:

home-mixer/filters/self_tweet_filter.rs

let viewer_id = query.user_id as u64;
let (kept, removed): (Vec<_>, Vec<_>) = candidates
    .into_iter()
    .partition(|c| c.author_id != viewer_id);

Why this matters: Users don’t want to see their own posts in their personalized feed (they already know what they posted).

RetweetDeduplicationFilter

Purpose: Deduplicate reposts, keeping only the first occurrence of a post (whether as original or repost). Algorithm:

Track seen post IDs (both original and reposted)
For each candidate:
- If it’s a repost: Check if the original post ID has been seen
- If it’s an original: Mark the post ID as seen
Keep only the first occurrence

Implementation:

home-mixer/filters/retweet_deduplication_filter.rs

for candidate in candidates {
    match candidate.retweeted_tweet_id {
        Some(retweeted_id) => {
            // Remove if we've already seen this tweet (as original or retweet)
            if seen_tweet_ids.insert(retweeted_id) {
                kept.push(candidate);
            } else {
                removed.push(candidate);
            }
        }
        None => {
            // Mark this original tweet ID as seen so retweets of it get filtered
            seen_tweet_ids.insert(candidate.tweet_id as u64);
            kept.push(candidate);
        }
    }
}

Why this matters: Showing both an original post and a repost of it is redundant and wastes feed space.

IneligibleSubscriptionFilter

Purpose: Remove paywalled/subscription content that the user cannot access. Removes:

Posts requiring paid subscription when user is not subscribed
Premium content behind authentication walls

Why this matters: Surfacing content users can’t access creates a poor user experience.

PreviouslySeenPostsFilter

Purpose: Filter out posts the user has already seen in previous sessions. Uses two mechanisms:

Bloom Filters
Explicit Seen IDs

Probabilistic data structures sent from the client containing hashes of previously seen post IDs.Advantages:

Memory efficient (can represent millions of IDs in a few KB)
Fast lookups (O(1))

Trade-off:

False positives possible (may filter a post that wasn’t actually seen)
No false negatives (will never show a post that was definitely seen)

Implementation:

home-mixer/filters/previously_seen_posts_filter.rs

let bloom_filters = query
    .bloom_filter_entries
    .iter()
    .map(BloomFilter::from_entry)
    .collect::<Vec<_>>();

let (removed, kept): (Vec<_>, Vec<_>) = candidates.into_iter().partition(|c| {
    get_related_post_ids(c).iter().any(|&post_id| {
        query.seen_ids.contains(&post_id)
            || bloom_filters
                .iter()
                .any(|filter| filter.may_contain(post_id))
    })
});

Why this matters: Showing posts users have already seen reduces engagement and satisfaction.

PreviouslyServedPostsFilter

Purpose: Remove posts that were already served earlier in the same request session. Difference from PreviouslySeenPostsFilter:

PreviouslySeenPostsFilter: Posts seen in past sessions (stored client-side)
PreviouslyServedPostsFilter: Posts served in this pagination sequence (server-side tracking)

Why this matters: Prevents duplicates when user scrolls through multiple pages of results.

MutedKeywordFilter

Purpose: Remove posts containing keywords the user has muted. Algorithm:

Tokenize user’s muted keywords
Tokenize post text
Check for matches using pattern matching

Implementation:

home-mixer/filters/muted_keyword_filter.rs

let tokenized = muted_keywords.iter().map(|k| self.tokenizer.tokenize(k));
let token_sequences: Vec<TokenSequence> = tokenized.collect::<Vec<_>>();
let user_mutes = UserMutes::new(token_sequences);
let matcher = MatchTweetGroup::new(user_mutes);

for candidate in candidates {
    let tweet_text_token_sequence = self.tokenizer.tokenize(&candidate.tweet_text);
    if matcher.matches(&tweet_text_token_sequence) {
        removed.push(candidate);
    } else {
        kept.push(candidate);
    }
}

Tokenization is required because users may mute phrases (“word1 word2”) rather than exact strings.

Why this matters: Respects user preferences for content they explicitly don’t want to see.

AuthorSocialgraphFilter

Purpose: Remove posts from authors the user has blocked or muted. Implementation:

home-mixer/filters/author_socialgraph_filter.rs

let viewer_blocked_user_ids = query.user_features.blocked_user_ids.clone();
let viewer_muted_user_ids = query.user_features.muted_user_ids.clone();

for candidate in candidates {
    let author_id = candidate.author_id as i64;
    let muted = viewer_muted_user_ids.contains(&author_id);
    let blocked = viewer_blocked_user_ids.contains(&author_id);
    if muted || blocked {
        removed.push(candidate);
    } else {
        kept.push(candidate);
    }
}

Why this matters: Critical for user safety and satisfaction. Users explicitly signal they don’t want content from these authors.

Post-Selection Filters

These filters run after selection on the top K candidates. They handle expensive operations or final validation.

VFFilter (Visibility Filtering)

Purpose: Remove posts that violate content policies or are otherwise ineligible for display. Removes posts that are:

Deleted
Marked as spam
Containing violence or gore
NSFW content (when user has safe mode enabled)
Copyright violations
Content from suspended accounts

Implementation:

home-mixer/filters/vf_filter.rs

fn should_drop(reason: &Option<FilteredReason>) -> bool {
    match reason {
        Some(FilteredReason::SafetyResult(safety_result)) => {
            matches!(safety_result.action, Action::Drop(_))
        }
        Some(_) => true,
        None => false,
    }
}

VF filtering happens post-selection because it requires calling external safety/visibility services, which would be too expensive to run on all candidates pre-scoring.

Why this matters: Ensures platform safety and content policy compliance.

DedupConversationFilter

Purpose: Deduplicate multiple branches of the same conversation thread. Problem it solves: If a viral post has many popular replies, the feed could show:

Original post
Reply A to original post
Reply B to original post
Reply C to original post

This creates a repetitive experience. Solution: Keep only the highest-scoring branch of each conversation, ensuring variety. Why this matters: Maintains feed diversity and prevents conversation threads from dominating the feed.

Filter Performance

Typical Removal Rates

Based on production metrics for a typical For You feed request:

Filter	Removal Rate	Description
DropDuplicatesFilter	1-3%	Low duplicate rate from sources
CoreDataHydrationFilter	0.5-1%	Most candidates hydrate successfully
AgeFilter	5-10%	Depends on source recency
SelfTweetFilter	0.1-0.5%	Few self-posts in sources
RetweetDeduplicationFilter	10-15%	Many viral posts get reposted
IneligibleSubscriptionFilter	2-5%	Subscription content is limited
PreviouslySeenPostsFilter	20-40%	High removal rate for engaged users
PreviouslyServedPostsFilter	5-10%	Pagination deduplication
MutedKeywordFilter	1-3%	Most users have few muted keywords
AuthorSocialgraphFilter	3-8%	Blocks/mutes accumulate over time
VFFilter	1-2%	Most selected content is policy-compliant
DedupConversationFilter	2-5%	Occasional conversation clusters

Latency Impact

Pre-Scoring Filters (Total): 5-15ms
- Most filters are in-memory operations (O(n) or O(n log n))
- AuthorSocialgraphFilter and MutedKeywordFilter are slightly more expensive
Post-Selection Filters (Total): 5-10ms
- VFFilter may call external services (cached aggressively)
- Runs on small candidate set (top K only)

Observability

Each filter emits metrics:

filter.candidates_in: 1500
filter.candidates_kept: 1200
filter.candidates_removed: 300
filter.removal_rate: 0.20
filter.latency_ms: 2.3

This enables:

Monitoring: Alert when filters remove unexpectedly high/low amounts
Debugging: Understand why specific candidates were filtered
Optimization: Identify slow filters that need optimization

Pipeline Stages

Complete overview of the recommendation pipeline

Scoring and Ranking

How candidates are scored after filtering

Candidate Hydration

How candidate metadata is enriched before filtering

User Features

User preferences and socialgraph used by filters

Getting Started

Core Components

How It Works

Phoenix ML System

Implementation

Key Concepts

Overview

Filter Architecture

Pre-Scoring Filters

DropDuplicatesFilter

CoreDataHydrationFilter

AgeFilter

SelfTweetFilter

RetweetDeduplicationFilter

IneligibleSubscriptionFilter

PreviouslySeenPostsFilter

PreviouslyServedPostsFilter

MutedKeywordFilter

AuthorSocialgraphFilter

Post-Selection Filters

VFFilter (Visibility Filtering)

DedupConversationFilter

Filter Performance

Typical Removal Rates

Latency Impact

Observability

Pipeline Stages

Scoring and Ranking

Candidate Hydration

User Features

Build docs developers (and LLMs) love

Getting Started

Core Components

How It Works

Phoenix ML System

Implementation

Key Concepts

​Overview

​Filter Architecture

​Pre-Scoring Filters

​DropDuplicatesFilter

​CoreDataHydrationFilter

​AgeFilter

​SelfTweetFilter

​RetweetDeduplicationFilter

​IneligibleSubscriptionFilter

​PreviouslySeenPostsFilter

​PreviouslyServedPostsFilter

​MutedKeywordFilter

​AuthorSocialgraphFilter

​Post-Selection Filters

​VFFilter (Visibility Filtering)

​DedupConversationFilter

​Filter Performance

​Typical Removal Rates

​Latency Impact

​Observability

​Related Pages

Pipeline Stages

Scoring and Ranking

Candidate Hydration

User Features

Build docs developers (and LLMs) love

Overview

Filter Architecture

Pre-Scoring Filters

DropDuplicatesFilter

CoreDataHydrationFilter

AgeFilter

SelfTweetFilter

RetweetDeduplicationFilter

IneligibleSubscriptionFilter

PreviouslySeenPostsFilter

PreviouslyServedPostsFilter

MutedKeywordFilter

AuthorSocialgraphFilter

Post-Selection Filters

VFFilter (Visibility Filtering)

DedupConversationFilter

Filter Performance

Typical Removal Rates

Latency Impact

Observability

Related Pages