Discovery Pipeline
The Discovery Pipeline is a sophisticated multi-stage system designed to automatically find, evaluate, and onboard high-quality creators across multiple social media platforms.Architecture Overview
The pipeline uses a queue-based architecture with three main stages:Core Components
The pipeline consists of several interconnected services:- QueueConfig: Manages BullMQ queues and workers
- DiscoveryScheduler: Schedules recurring discovery jobs
- TrendingDiscovery: Searches for creators by trending topics
- CreatorEvaluator: Evaluates creator quality and suitability
- DuplicateDetector: Prevents duplicate creator entries
- DataAggregationEngine: Enriches creator profiles with cross-platform data
Queue System
The pipeline uses three specialized queues, each with different concurrency settings optimized for their workload:Creator Discovery Queue
Concurrency: 5 workers Job Types:DISCOVER_TRENDING- Find creators from trending topicsEXPLORE_CATEGORY- Explore specific categories across platformsREFRESH_EXISTING- Update existing creator data
Creator Evaluation Queue
Concurrency: 10 workers Purpose: Evaluates discovered creators based on:- Follower count and growth rate
- Engagement rate metrics
- Content quality indicators
- Audience demographics
- Brand safety factors
Creator Aggregation Queue
Concurrency: 3 workers Purpose: Aggregates cross-platform data and calculates composite scoresDiscovery Methods
Trending Topic Discovery
Searches for creators based on viral topics and hashtags:lib/discovery/discovery-pipeline.ts:131-177
Category Exploration
Systematically explores content categories across all platforms:lib/discovery/discovery-pipeline.ts:182-231
Creator Refresh
Updates existing creator profiles with latest metrics:lib/discovery/discovery-pipeline.ts:236-259
Pipeline Control
Starting the Pipeline
Monitoring Pipeline Status
lib/discovery/discovery-pipeline.ts:373-392
Stopping the Pipeline
Duplicate Detection
Before queuing creators for evaluation, the pipeline checks for duplicates:- Username matching across platforms
- Profile URL comparison
- Fuzzy matching for display names
- Cross-platform identifier correlation
Discovery Reports
Generate comprehensive reports on discovery performance:lib/discovery/discovery-pipeline.ts:397-480
Report Structure
Configuration Options
Queue Concurrency
Adjust worker concurrency based on your infrastructure:Job Priorities
Scheduler Configuration
TheDiscoveryScheduler manages recurring jobs:
Performance Optimization
Batch Processing
Process multiple creators in batches to reduce database queries:Queue Metrics
Monitor queue health and performance:Error Handling
The pipeline includes comprehensive error handling:Retry Strategy
Configure job retries in queue settings:Database Models
The pipeline interacts with these key models:CreatorProfile
Stores basic creator information and discovery metadata:Platform-Specific Metrics
TiktokProfile- TikTok engagement metricsInstagramProfile- Instagram profile dataTwitterMetrics- Twitter analyticsYoutubeMetrics- YouTube channel stats
FAQ: Discovery Pipeline
FAQ: Discovery Pipeline
Q: How often should I run discovery jobs?A: Trending discovery should run 2-4 times daily. Category exploration can run weekly. Creator refresh depends on your data freshness requirements - typically daily for active creators.Q: What determines if a creator is added vs monitored?A: The
CreatorEvaluator uses a quality score based on engagement rate, follower count, content consistency, and audience quality. Scores above the threshold result in “add”, mid-range scores get “monitor”, low scores are rejected.Q: How do I prevent duplicate creators across platforms?A: The DuplicateDetector checks profile URLs, usernames, and cross-platform identifiers. It also uses fuzzy matching on display names to catch variations.Q: Can I customize evaluation criteria?A: Yes, modify the CreatorEvaluator class to adjust scoring weights, minimum thresholds, and quality factors based on your requirements.Q: What happens if a queue gets backed up?A: Monitor queue metrics and increase worker concurrency if needed. You can also implement job prioritization to process high-value discoveries first.Q: How do I add a new discovery source?A: Add a new JobType enum value, create a processing method in DiscoveryPipeline, and register it in the job router (processDiscoveryJob method).Next Steps
- Learn about Budget Management to control discovery costs
- Set up API Monitoring to track platform API usage
- Use CLI Tools to manually trigger discovery jobs