Introduction

Overview

The IMDb Scraper is a production-grade data extraction system designed to scrape IMDb’s Top 250 movies chart with advanced anti-blocking capabilities, hybrid persistence (CSV + PostgreSQL), and a completely Dockerized deployment workflow. This project demonstrates professional software engineering practices by implementing Clean Architecture and Domain-Driven Design (DDD) principles, ensuring the codebase is maintainable, testable, and ready to scale.

Built as a technical demonstration, this scraper showcases expertise in distributed scraping, architectural design patterns, and production-ready data pipelines.

What Does It Do?

The IMDb Scraper extracts comprehensive movie data from IMDb’s Top 250 chart, including:

Title - Movie name
Year - Release year
Rating - IMDb user rating (0-10)
Duration - Runtime in minutes
Metascore - Metacritic score (0-100)
Actors - Top 3 cast members from detail pages

All data is persisted in both CSV files (for portability) and PostgreSQL (for advanced SQL analytics).

Key Features

Distributed Network Strategy

The scraper implements a multi-layered evasion strategy to bypass IP-based blocking:

VPN Integration

ProtonVPN running in Docker for geolocation shifting

Premium Proxies

DataImpulse rotating proxies with automatic fallback

TOR Network

Anonymous IP rotation via TOR as final fallback layer

The scraper automatically validates IP changes and includes exponential backoff with intelligent retry logic.

Clean Architecture + DDD

The project follows a strict layered architecture:

├── domain/          # Business entities and interfaces
├── application/     # Use cases orchestrating business logic
├── infrastructure/  # Technical implementations (scraper, DB, network)
├── presentation/    # CLI entry point
└── shared/          # Config, logging, utilities

Benefits:

Separation of Concerns - Dependencies point inward; domain logic is framework-agnostic
Testability - Core business logic can be unit tested in isolation
Maintainability - Clear boundaries between layers make changes predictable
Scalability - Easy to swap implementations (e.g., switch from requests to Playwright)

Hybrid Persistence

Data is saved simultaneously to two storage backends:

Storage	Use Case	Format
CSV	Quick exports, data portability, Excel analysis	`movies.csv`, `actors.csv`, `movie_actor.csv`
PostgreSQL	Relational queries, analytics, production-grade storage	Normalized schema with N:M relationships

Concurrency & Performance

ThreadPoolExecutor for parallel detail page scraping
Configurable thread pool (default: 50 workers)
Intelligent request throttling to avoid rate limits
Traffic monitoring and logging

Architecture Highlights

Domain Models with Built-in Validation

Entities enforce their own business rules:

@dataclass
class Movie:
    id: Optional[int]
    imdb_id: str
    title: str
    year: int
    rating: float
    duration_minutes: Optional[int]
    metascore: Optional[int]
    actors: List[Actor] = field(default_factory=list)

    def __post_init__(self):
        # Validates IMDb ID format
        if not re.match(r"^tt\d{7,}$", self.imdb_id):
            raise ValueError(f"IMDb ID inválido: '{self.imdb_id}'")
        
        # Validates year range
        if not (1888 <= self.year <= 2030):
            raise ValueError(f"Año inválido: {self.year}")

Factory Pattern & Dependency Injection

A centralized DependencyContainer manages object lifecycles and wiring:

container = DependencyContainer(config)
scraper = container.get_scraper()
scraper.scrape()
container.close_db_connection()

This decouples the application from concrete implementations, making it trivial to swap scraping engines (e.g., Playwright, Selenium) or persistence layers.

When to Use This Scraper

Use Cases

Building movie analytics dashboards
Training ML models on film data
Creating recommendation systems
Academic research on cinema trends
Portfolio/demonstration projects

Not Suitable For

Real-time production scraping (check IMDb’s ToS)
High-frequency data extraction
Commercial resale of IMDb data
Projects requiring live updates

This scraper is built for educational and demonstration purposes. Always respect website Terms of Service and robots.txt policies when scraping.

Technology Stack

Python 3.x - Core language
BeautifulSoup4 - HTML parsing
Requests - HTTP client with SOCKS proxy support
PostgreSQL 15 - Relational database
Docker & Docker Compose - Container orchestration
TOR - Anonymous network routing
ProtonVPN - VPN integration via Gluetun

What’s Next?

Quickstart Guide

Get the scraper running in 5 minutes with Docker

Architecture Deep Dive

Explore Clean Architecture implementation and design patterns

Project Author: Andrés Ruiz
Email: [email protected]
GitHub: frankdevg

Get Started

Architecture

Core Features

Data & SQL

Deployment

Overview

What Does It Do?

Key Features

Distributed Network Strategy

VPN Integration

Premium Proxies

TOR Network

Clean Architecture + DDD

Hybrid Persistence

Concurrency & Performance

Architecture Highlights

Domain Models with Built-in Validation

Factory Pattern & Dependency Injection

When to Use This Scraper

Use Cases

Not Suitable For

Technology Stack

What’s Next?

Quickstart Guide

Architecture Deep Dive

Build docs developers (and LLMs) love

Get Started

Architecture

Core Features

Data & SQL

Deployment

​Overview

​What Does It Do?

​Key Features

​Distributed Network Strategy

VPN Integration

Premium Proxies

TOR Network

​Clean Architecture + DDD

​Hybrid Persistence

​Concurrency & Performance

​Architecture Highlights

​Domain Models with Built-in Validation

​Factory Pattern & Dependency Injection

​When to Use This Scraper

Use Cases

Not Suitable For

​Technology Stack

​What’s Next?

Quickstart Guide

Architecture Deep Dive

Build docs developers (and LLMs) love

Overview

What Does It Do?

Key Features

Distributed Network Strategy

Clean Architecture + DDD

Hybrid Persistence

Concurrency & Performance

Architecture Highlights

Domain Models with Built-in Validation

Factory Pattern & Dependency Injection

When to Use This Scraper

Technology Stack

What’s Next?