Architecture Overview

Introduction

The IMDb Scraper is built using Clean Architecture and Domain-Driven Design (DDD) principles to ensure maintainability, scalability, and testability. This architecture enables the system to evolve without coupling business logic to technical implementation details.

Architecture Diagram

Layer Responsibilities

Domain Layer

The core of the application containing business entities and rules.

Entities

Movie, Actor, MovieActor models with built-in validation

Interfaces

Repository and service contracts (abstractions)

Business Rules

Domain validation logic embedded in entities

Zero Dependencies

No dependencies on external frameworks or libraries

Application Layer

Orchestrates business logic through use cases.

SaveMovieWithActorsCsvUseCase: Persists data to CSV files
SaveMovieWithActorsPostgresUseCase: Persists data to PostgreSQL
CompositeSaveMovieWithActorsUseCase: Executes multiple use cases concurrently

Use cases depend only on domain interfaces, never on concrete implementations. This enables easy testing and swapping of implementations.

Infrastructure Layer

Provides concrete implementations of domain interfaces.

Persistence

CSV repositories for file-based storage
PostgreSQL repositories for relational database storage
Connection pooling and resource management

Scraping

IMDb scraper implementation
Retry logic with exponential backoff
Concurrent scraping with ThreadPoolExecutor

Network

Proxy provider (DataImpulse integration)
TOR rotator for IP rotation
VPN integration via Docker

Factory

DependencyContainer for dependency injection
Centralized object creation and lifecycle management

Presentation Layer

Entry points for the application.

CLI (run_scraper.py): Command-line interface for executing the scraper
Minimal logic - delegates to application layer

Dependency Direction

One of the key principles of Clean Architecture is that dependencies point inward:

Presentation → Application → Domain ← Infrastructure

The domain layer has zero dependencies on outer layers. Infrastructure and presentation layers depend on domain abstractions, never vice versa.

This design ensures: ✅ Testability: Domain and application layers can be tested without databases or external services
✅ Flexibility: Swap implementations (e.g., CSV to MongoDB) without changing business logic
✅ Maintainability: Changes to infrastructure don’t cascade to business logic
✅ Independence: Business rules aren’t coupled to frameworks, UI, or databases

Directory Structure

imdb_scraper_project/
├── presentation/          # Entry points (CLI)
│   └── cli/
│       └── run_scraper.py
├── application/           # Use cases
│   └── use_cases/
│       ├── save_movie_with_actors_csv_use_case.py
│       ├── save_movie_with_actors_postgres_use_case.py
│       └── composite_save_movie_with_actors_use_case.py
├── domain/                # Business entities and contracts
│   ├── models/
│   │   ├── movie.py
│   │   ├── actor.py
│   │   └── movie_actor.py
│   ├── interfaces/
│   │   ├── scraper_interface.py
│   │   ├── use_case_interface.py
│   │   └── proxy_interface.py
│   └── repositories/
│       ├── movie_repository.py
│       ├── actor_repository.py
│       └── movie_actor_repository.py
├── infrastructure/         # Technical implementations
│   ├── factory/
│   │   └── dependency_container.py
│   ├── scraper/
│   │   └── imdb_scraper.py
│   ├── persistence/
│   │   ├── csv/
│   │   └── postgres/
│   └── network/
│       ├── proxy_provider.py
│       └── tor_rotator.py
└── shared/                # Cross-cutting concerns
    ├── config/
    └── logger/

Benefits of This Architecture

Testable

Each layer can be tested independently with mocks and stubs

Maintainable

Clear separation of concerns makes code easier to understand and modify

Scalable

Add new features without modifying existing code (Open/Closed Principle)

Flexible

Swap implementations (e.g., Playwright for requests) without business logic changes

Real-World Application

The architecture has proven its value in this project:

Hybrid Persistence: Simultaneously saves to CSV and PostgreSQL without duplicating business logic
Network Resilience: Easily integrated VPN, proxies, and TOR rotation
Future-Ready: Can add Playwright/Selenium scraper by implementing ScraperInterface
Concurrent Processing: Composite use case executes multiple persistence strategies in parallel

This architecture transforms a simple scraper into a professional, production-ready system that can evolve with changing requirements.

Next Steps

Clean Architecture Details

Deep dive into Clean Architecture principles

Domain Models

Explore entities and validation logic

Dependency Injection

Learn how dependencies are wired together

Getting Started

Start using the IMDb Scraper

Get Started

Architecture

Core Features

Data & SQL

Deployment

Introduction

Architecture Diagram

Layer Responsibilities

Domain Layer

Entities

Interfaces

Business Rules

Zero Dependencies

Application Layer

Infrastructure Layer

Presentation Layer

Dependency Direction

Directory Structure

Benefits of This Architecture

Testable

Maintainable

Scalable

Flexible

Real-World Application

Next Steps

Clean Architecture Details

Domain Models

Dependency Injection

Getting Started

Build docs developers (and LLMs) love

Get Started

Architecture

Core Features

Data & SQL

Deployment

​Introduction

​Architecture Diagram

​Layer Responsibilities

​Domain Layer

Entities

Interfaces

Business Rules

Zero Dependencies

​Application Layer

​Infrastructure Layer

​Presentation Layer

​Dependency Direction

​Directory Structure

​Benefits of This Architecture

Testable

Maintainable

Scalable

Flexible

​Real-World Application

​Next Steps

Clean Architecture Details

Domain Models

Dependency Injection

Getting Started

Build docs developers (and LLMs) love

Introduction

Architecture Diagram

Layer Responsibilities

Domain Layer

Application Layer

Infrastructure Layer

Presentation Layer

Dependency Direction

Directory Structure

Benefits of This Architecture

Real-World Application

Next Steps