Overview
The IMDb Scraper is a production-grade data extraction system designed to scrape IMDb’s Top 250 movies chart with advanced anti-blocking capabilities, hybrid persistence (CSV + PostgreSQL), and a completely Dockerized deployment workflow. This project demonstrates professional software engineering practices by implementing Clean Architecture and Domain-Driven Design (DDD) principles, ensuring the codebase is maintainable, testable, and ready to scale.Built as a technical demonstration, this scraper showcases expertise in distributed scraping, architectural design patterns, and production-ready data pipelines.
What Does It Do?
The IMDb Scraper extracts comprehensive movie data from IMDb’s Top 250 chart, including:- Title - Movie name
- Year - Release year
- Rating - IMDb user rating (0-10)
- Duration - Runtime in minutes
- Metascore - Metacritic score (0-100)
- Actors - Top 3 cast members from detail pages
Key Features
Distributed Network Strategy
The scraper implements a multi-layered evasion strategy to bypass IP-based blocking:VPN Integration
ProtonVPN running in Docker for geolocation shifting
Premium Proxies
DataImpulse rotating proxies with automatic fallback
TOR Network
Anonymous IP rotation via TOR as final fallback layer
The scraper automatically validates IP changes and includes exponential backoff with intelligent retry logic.
Clean Architecture + DDD
The project follows a strict layered architecture:- Separation of Concerns - Dependencies point inward; domain logic is framework-agnostic
- Testability - Core business logic can be unit tested in isolation
- Maintainability - Clear boundaries between layers make changes predictable
- Scalability - Easy to swap implementations (e.g., switch from requests to Playwright)
Hybrid Persistence
Data is saved simultaneously to two storage backends:| Storage | Use Case | Format |
|---|---|---|
| CSV | Quick exports, data portability, Excel analysis | movies.csv, actors.csv, movie_actor.csv |
| PostgreSQL | Relational queries, analytics, production-grade storage | Normalized schema with N:M relationships |
Concurrency & Performance
- ThreadPoolExecutor for parallel detail page scraping
- Configurable thread pool (default: 50 workers)
- Intelligent request throttling to avoid rate limits
- Traffic monitoring and logging
Architecture Highlights
Domain Models with Built-in Validation
Entities enforce their own business rules:Factory Pattern & Dependency Injection
A centralizedDependencyContainer manages object lifecycles and wiring:
When to Use This Scraper
Use Cases
- Building movie analytics dashboards
- Training ML models on film data
- Creating recommendation systems
- Academic research on cinema trends
- Portfolio/demonstration projects
Not Suitable For
- Real-time production scraping (check IMDb’s ToS)
- High-frequency data extraction
- Commercial resale of IMDb data
- Projects requiring live updates
Technology Stack
- Python 3.x - Core language
- BeautifulSoup4 - HTML parsing
- Requests - HTTP client with SOCKS proxy support
- PostgreSQL 15 - Relational database
- Docker & Docker Compose - Container orchestration
- TOR - Anonymous network routing
- ProtonVPN - VPN integration via Gluetun
What’s Next?
Quickstart Guide
Get the scraper running in 5 minutes with Docker
Architecture Deep Dive
Explore Clean Architecture implementation and design patterns
Project Author: Andrés Ruiz
Email: [email protected]
GitHub: frankdevg