Introduction
The IMDb Scraper is built using Clean Architecture and Domain-Driven Design (DDD) principles to ensure maintainability, scalability, and testability. This architecture enables the system to evolve without coupling business logic to technical implementation details.Architecture Diagram
Layer Responsibilities
Domain Layer
The core of the application containing business entities and rules.Entities
Movie, Actor, MovieActor models with built-in validationInterfaces
Repository and service contracts (abstractions)
Business Rules
Domain validation logic embedded in entities
Zero Dependencies
No dependencies on external frameworks or libraries
Application Layer
Orchestrates business logic through use cases.- SaveMovieWithActorsCsvUseCase: Persists data to CSV files
- SaveMovieWithActorsPostgresUseCase: Persists data to PostgreSQL
- CompositeSaveMovieWithActorsUseCase: Executes multiple use cases concurrently
Use cases depend only on domain interfaces, never on concrete implementations. This enables easy testing and swapping of implementations.
Infrastructure Layer
Provides concrete implementations of domain interfaces.Persistence
Persistence
- CSV repositories for file-based storage
- PostgreSQL repositories for relational database storage
- Connection pooling and resource management
Scraping
Scraping
- IMDb scraper implementation
- Retry logic with exponential backoff
- Concurrent scraping with ThreadPoolExecutor
Network
Network
- Proxy provider (DataImpulse integration)
- TOR rotator for IP rotation
- VPN integration via Docker
Factory
Factory
- DependencyContainer for dependency injection
- Centralized object creation and lifecycle management
Presentation Layer
Entry points for the application.- CLI (
run_scraper.py): Command-line interface for executing the scraper - Minimal logic - delegates to application layer
Dependency Direction
One of the key principles of Clean Architecture is that dependencies point inward:✅ Flexibility: Swap implementations (e.g., CSV to MongoDB) without changing business logic
✅ Maintainability: Changes to infrastructure don’t cascade to business logic
✅ Independence: Business rules aren’t coupled to frameworks, UI, or databases
Directory Structure
Benefits of This Architecture
Testable
Each layer can be tested independently with mocks and stubs
Maintainable
Clear separation of concerns makes code easier to understand and modify
Scalable
Add new features without modifying existing code (Open/Closed Principle)
Flexible
Swap implementations (e.g., Playwright for requests) without business logic changes
Real-World Application
The architecture has proven its value in this project:- Hybrid Persistence: Simultaneously saves to CSV and PostgreSQL without duplicating business logic
- Network Resilience: Easily integrated VPN, proxies, and TOR rotation
- Future-Ready: Can add Playwright/Selenium scraper by implementing
ScraperInterface - Concurrent Processing: Composite use case executes multiple persistence strategies in parallel
This architecture transforms a simple scraper into a professional, production-ready system that can evolve with changing requirements.
Next Steps
Clean Architecture Details
Deep dive into Clean Architecture principles
Domain Models
Explore entities and validation logic
Dependency Injection
Learn how dependencies are wired together
Getting Started
Start using the IMDb Scraper