Algorithm overview
The ranking score combines two factors:This exponential decay formula ensures that older notebooks gradually lose relevance while not completely eliminating valuable historical content.
Implementation
The ranking logic is inbackend/services/notebook_service.py:471:
Why this algorithm?
Problem: Balancing quality and freshness
Kaggle notebooks face a unique challenge:- High upvote notebooks often represent genuinely valuable insights
- Old notebooks may use deprecated libraries or outdated techniques
- Recent notebooks might lack community validation
A pure upvote-based ranking would favor 3-year-old notebooks that accumulated votes over time. A pure recency ranking would surface untested content.
Solution: Exponential time decay
The decay factor0.95 ^ (age_days / 30) implements half-life decay:
| Notebook age | Decay factor | Effective upvotes (1000 base) |
|---|---|---|
| 0 days | 1.00 | 1000 |
| 30 days | 0.95 | 950 |
| 90 days | 0.86 | 860 |
| 180 days | 0.74 | 740 |
| 365 days | 0.54 | 540 |
| 730 days | 0.30 | 300 |
Ranking in practice
Example comparison
Consider two notebooks for the Titanic competition: Notebook A: “Classic EDA + Random Forest”- Upvotes: 2500
- Age: 720 days
- Score:
2500 * (0.95 ^ (720/30))= 2500 * 0.31 = 775
- Upvotes: 800
- Age: 60 days
- Score:
800 * (0.95 ^ (60/30))= 800 * 0.90 = 720
Edge case: Missing dates
If a notebook lacks alast_updated field, it defaults to 730 days old (2 years):
Integration with ingestion pipeline
The ranking algorithm runs after fetching notebooks from Kaggle API:KaggleIngest fetches 20 notebooks but only downloads and processes the top 3-10 (user-configurable) to balance context quality and processing cost.
Upvote sources
Kaggle API responses may use different field names for upvotes:- Kaggle API v1 (
total_votes) - Kaggle API v2 (
upvotes) - Missing data (defaults to
0)
Alternative algorithms considered
Why not logarithmic upvote scaling?
Why not logarithmic upvote scaling?
Some ranking systems (e.g., Reddit, Hacker News) use
log(upvotes) to prevent viral content from dominating. KaggleIngest uses linear upvotes because:- Kaggle notebooks rarely accumulate extreme upvote counts (max ~5000)
- High upvotes genuinely correlate with quality in the Kaggle ecosystem
- Linear scaling preserves the relative importance of community validation
Why not machine learning-based ranking?
Why not machine learning-based ranking?
ML-based ranking (using features like code quality, markdown length, execution success) was considered but rejected because:
- Cold start problem: Requires training data
- Complexity: Adds model maintenance overhead
- Interpretability: Users can’t easily understand why notebooks rank as they do
- Sufficient quality: The current heuristic performs well in practice
Tuning the decay rate
The decay rate0.95 is configurable. Here’s how different rates affect ranking:
| Decay rate | 30-day retention | 365-day retention | Use case |
|---|---|---|---|
| 0.90 | 90% | 27% | Fast-moving fields (ML research) |
| 0.95 | 95% | 54% | Current default |
| 0.98 | 98% | 78% | Stable knowledge domains |
Performance impact
Ranking is an O(n log n) operation due to sorting:- Date parsing: O(n)
- Score calculation: O(n)
- Sorting: O(n log n)
- Total: O(n log n)
Testing and validation
Ranking correctness is validated through unit tests:Future improvements
- Personalized ranking: Account for user preferences (e.g., prefer PyTorch over TensorFlow)
- Execution success weight: Boost notebooks that run without errors
- Comment sentiment: Incorporate positive/negative feedback from comments
- Medal weight: Prioritize notebooks from competition winners
Related resources
- Implementation:
backend/services/notebook_service.py:471 - Notebook data model:
backend/models/notebook.py - Ingestion pipeline:
backend/services/notebook_service.py:186