Skip to main content

Overview

Filters remove candidates that should not be shown to users. The pipeline applies filters at two stages:
  1. Pre-Scoring Filters: Run before ML scoring to reduce compute costs
  2. Post-Selection Filters: Run after selection for final validation
Filters are applied sequentially, with each filter receiving the kept candidates from the previous filter.

Filter Architecture

All filters implement the Filter trait:
#[async_trait]
pub trait Filter<Q, C> {
    async fn filter(
        &self,
        query: &Q,
        candidates: Vec<C>,
    ) -> Result<FilterResult<C>, String>;
}

pub struct FilterResult<C> {
    pub kept: Vec<C>,
    pub removed: Vec<C>,
}
This enables:
  • Observability: Track how many candidates each filter removes
  • Debugging: Inspect which candidates were filtered and why
  • Metrics: Monitor filter effectiveness and false positive rates

Pre-Scoring Filters

These filters run before the Phoenix scorer to reduce the number of candidates that need ML predictions.

DropDuplicatesFilter

Purpose: Remove duplicate post IDs within the same candidate batch. Implementation:
home-mixer/filters/drop_duplicates_filter.rs
let mut seen_ids = HashSet::new();
for candidate in candidates {
    if seen_ids.insert(candidate.tweet_id) {
        kept.push(candidate);
    } else {
        removed.push(candidate);
    }
}
Why this matters: Duplicates can occur when multiple sources return the same post (e.g., Thunder and Phoenix both surface a viral post).

CoreDataHydrationFilter

Purpose: Remove posts that failed to hydrate core metadata. Removes candidates missing:
  • Post text
  • Author information
  • Timestamps
  • Media URLs (for media posts)
Why this matters: Candidates without core data cannot be properly displayed or scored.

AgeFilter

Purpose: Remove posts older than a specified maximum age. Default threshold: Typically 24-48 hours for For You feed Implementation:
home-mixer/filters/age_filter.rs
pub struct AgeFilter {
    pub max_age: Duration,
}

fn is_within_age(&self, tweet_id: i64) -> bool {
    snowflake::duration_since_creation_opt(tweet_id)
        .map(|age| age <= self.max_age)
        .unwrap_or(false)
}
Post IDs use Snowflake encoding, which embeds the creation timestamp. This allows age calculation without database lookups.
Why this matters: Users expect fresh, recent content in their For You feed.

SelfTweetFilter

Purpose: Remove posts authored by the viewing user. Implementation:
home-mixer/filters/self_tweet_filter.rs
let viewer_id = query.user_id as u64;
let (kept, removed): (Vec<_>, Vec<_>) = candidates
    .into_iter()
    .partition(|c| c.author_id != viewer_id);
Why this matters: Users don’t want to see their own posts in their personalized feed (they already know what they posted).

RetweetDeduplicationFilter

Purpose: Deduplicate reposts, keeping only the first occurrence of a post (whether as original or repost). Algorithm:
  1. Track seen post IDs (both original and reposted)
  2. For each candidate:
    • If it’s a repost: Check if the original post ID has been seen
    • If it’s an original: Mark the post ID as seen
  3. Keep only the first occurrence
Implementation:
home-mixer/filters/retweet_deduplication_filter.rs
for candidate in candidates {
    match candidate.retweeted_tweet_id {
        Some(retweeted_id) => {
            // Remove if we've already seen this tweet (as original or retweet)
            if seen_tweet_ids.insert(retweeted_id) {
                kept.push(candidate);
            } else {
                removed.push(candidate);
            }
        }
        None => {
            // Mark this original tweet ID as seen so retweets of it get filtered
            seen_tweet_ids.insert(candidate.tweet_id as u64);
            kept.push(candidate);
        }
    }
}
Why this matters: Showing both an original post and a repost of it is redundant and wastes feed space.

IneligibleSubscriptionFilter

Purpose: Remove paywalled/subscription content that the user cannot access. Removes:
  • Posts requiring paid subscription when user is not subscribed
  • Premium content behind authentication walls
Why this matters: Surfacing content users can’t access creates a poor user experience.

PreviouslySeenPostsFilter

Purpose: Filter out posts the user has already seen in previous sessions. Uses two mechanisms:
Probabilistic data structures sent from the client containing hashes of previously seen post IDs.Advantages:
  • Memory efficient (can represent millions of IDs in a few KB)
  • Fast lookups (O(1))
Trade-off:
  • False positives possible (may filter a post that wasn’t actually seen)
  • No false negatives (will never show a post that was definitely seen)
Implementation:
home-mixer/filters/previously_seen_posts_filter.rs
let bloom_filters = query
    .bloom_filter_entries
    .iter()
    .map(BloomFilter::from_entry)
    .collect::<Vec<_>>();

let (removed, kept): (Vec<_>, Vec<_>) = candidates.into_iter().partition(|c| {
    get_related_post_ids(c).iter().any(|&post_id| {
        query.seen_ids.contains(&post_id)
            || bloom_filters
                .iter()
                .any(|filter| filter.may_contain(post_id))
    })
});
Why this matters: Showing posts users have already seen reduces engagement and satisfaction.

PreviouslyServedPostsFilter

Purpose: Remove posts that were already served earlier in the same request session. Difference from PreviouslySeenPostsFilter:
  • PreviouslySeenPostsFilter: Posts seen in past sessions (stored client-side)
  • PreviouslyServedPostsFilter: Posts served in this pagination sequence (server-side tracking)
Why this matters: Prevents duplicates when user scrolls through multiple pages of results.

MutedKeywordFilter

Purpose: Remove posts containing keywords the user has muted. Algorithm:
  1. Tokenize user’s muted keywords
  2. Tokenize post text
  3. Check for matches using pattern matching
Implementation:
home-mixer/filters/muted_keyword_filter.rs
let tokenized = muted_keywords.iter().map(|k| self.tokenizer.tokenize(k));
let token_sequences: Vec<TokenSequence> = tokenized.collect::<Vec<_>>();
let user_mutes = UserMutes::new(token_sequences);
let matcher = MatchTweetGroup::new(user_mutes);

for candidate in candidates {
    let tweet_text_token_sequence = self.tokenizer.tokenize(&candidate.tweet_text);
    if matcher.matches(&tweet_text_token_sequence) {
        removed.push(candidate);
    } else {
        kept.push(candidate);
    }
}
Tokenization is required because users may mute phrases (“word1 word2”) rather than exact strings.
Why this matters: Respects user preferences for content they explicitly don’t want to see.

AuthorSocialgraphFilter

Purpose: Remove posts from authors the user has blocked or muted. Implementation:
home-mixer/filters/author_socialgraph_filter.rs
let viewer_blocked_user_ids = query.user_features.blocked_user_ids.clone();
let viewer_muted_user_ids = query.user_features.muted_user_ids.clone();

for candidate in candidates {
    let author_id = candidate.author_id as i64;
    let muted = viewer_muted_user_ids.contains(&author_id);
    let blocked = viewer_blocked_user_ids.contains(&author_id);
    if muted || blocked {
        removed.push(candidate);
    } else {
        kept.push(candidate);
    }
}
Why this matters: Critical for user safety and satisfaction. Users explicitly signal they don’t want content from these authors.

Post-Selection Filters

These filters run after selection on the top K candidates. They handle expensive operations or final validation.

VFFilter (Visibility Filtering)

Purpose: Remove posts that violate content policies or are otherwise ineligible for display. Removes posts that are:
  • Deleted
  • Marked as spam
  • Containing violence or gore
  • NSFW content (when user has safe mode enabled)
  • Copyright violations
  • Content from suspended accounts
Implementation:
home-mixer/filters/vf_filter.rs
fn should_drop(reason: &Option<FilteredReason>) -> bool {
    match reason {
        Some(FilteredReason::SafetyResult(safety_result)) => {
            matches!(safety_result.action, Action::Drop(_))
        }
        Some(_) => true,
        None => false,
    }
}
VF filtering happens post-selection because it requires calling external safety/visibility services, which would be too expensive to run on all candidates pre-scoring.
Why this matters: Ensures platform safety and content policy compliance.

DedupConversationFilter

Purpose: Deduplicate multiple branches of the same conversation thread. Problem it solves: If a viral post has many popular replies, the feed could show:
  1. Original post
  2. Reply A to original post
  3. Reply B to original post
  4. Reply C to original post
This creates a repetitive experience. Solution: Keep only the highest-scoring branch of each conversation, ensuring variety. Why this matters: Maintains feed diversity and prevents conversation threads from dominating the feed.

Filter Performance

Typical Removal Rates

Based on production metrics for a typical For You feed request:
FilterRemoval RateDescription
DropDuplicatesFilter1-3%Low duplicate rate from sources
CoreDataHydrationFilter0.5-1%Most candidates hydrate successfully
AgeFilter5-10%Depends on source recency
SelfTweetFilter0.1-0.5%Few self-posts in sources
RetweetDeduplicationFilter10-15%Many viral posts get reposted
IneligibleSubscriptionFilter2-5%Subscription content is limited
PreviouslySeenPostsFilter20-40%High removal rate for engaged users
PreviouslyServedPostsFilter5-10%Pagination deduplication
MutedKeywordFilter1-3%Most users have few muted keywords
AuthorSocialgraphFilter3-8%Blocks/mutes accumulate over time
VFFilter1-2%Most selected content is policy-compliant
DedupConversationFilter2-5%Occasional conversation clusters

Latency Impact

  • Pre-Scoring Filters (Total): 5-15ms
    • Most filters are in-memory operations (O(n) or O(n log n))
    • AuthorSocialgraphFilter and MutedKeywordFilter are slightly more expensive
  • Post-Selection Filters (Total): 5-10ms
    • VFFilter may call external services (cached aggressively)
    • Runs on small candidate set (top K only)

Observability

Each filter emits metrics:
filter.candidates_in: 1500
filter.candidates_kept: 1200
filter.candidates_removed: 300
filter.removal_rate: 0.20
filter.latency_ms: 2.3
This enables:
  • Monitoring: Alert when filters remove unexpectedly high/low amounts
  • Debugging: Understand why specific candidates were filtered
  • Optimization: Identify slow filters that need optimization

Pipeline Stages

Complete overview of the recommendation pipeline

Scoring and Ranking

How candidates are scored after filtering

Candidate Hydration

How candidate metadata is enriched before filtering

User Features

User preferences and socialgraph used by filters

Build docs developers (and LLMs) love