concurrent.futures.ThreadPoolExecutor to parallelize movie detail extraction, significantly improving scraping performance while maintaining stability.
ThreadPoolExecutor Architecture
The scraper implements thread-based parallelism at two levels:- Movie detail fetching - Parallel HTTP requests for movie pages
- Dual persistence - Concurrent CSV and PostgreSQL writes
Movie Detail Fetching
Implementation
The main scraping loop uses ThreadPoolExecutor to process multiple movies simultaneously:infrastructure/scraper/imdb_scraper.py:40
Worker Function
Each thread executes the_scrape_and_save_movie_detail method:
infrastructure/scraper/imdb_scraper.py:56
Thread Pool Configuration
Configurable Thread Count
The number of concurrent threads is configurable viaconfig.py:
shared/config/config.py:53
Optimal Thread Count
The default configuration uses 50 threads, balancing:- Performance: Parallel HTTP requests reduce total scraping time
- Resource usage: Prevents overwhelming the network or target server
- Rate limiting: Stays within acceptable request rates
Persistence Concurrency
Composite Use Case with ThreadPoolExecutor
The persistence layer also uses threads to write to CSV and PostgreSQL simultaneously:application/use_cases/composite_save_movie_with_actors_use_case.py:25
Parallel Persistence Strategies
When a movie is scraped, it’s saved to both backends concurrently:Thread Safety
CSV Thread Safety
The CSV repository uses threading locks to prevent race conditions:infrastructure/persistence/csv/repositories/movie_csv_repository.py:39
PostgreSQL Thread Safety
PostgreSQL handles concurrent writes through its connection pooling and transaction isolation:infrastructure/persistence/postgres/repositories/movie_postgres_repository.py:21
Performance Benefits
Sequential vs Parallel Execution
Sequential scraping (1 thread):Real-World Performance
With network latency and retry logic:- Sequential: ~15-20 minutes for 250 movies
- Parallel (50 threads): ~2-3 minutes for 250 movies
Resource Management
Automatic Cleanup
ThreadPoolExecutor automatically manages thread lifecycle:Error Isolation
Each thread handles its own errors without affecting other threads:Execution Flow
Configuration Options
Thread Pool Size
Adjust the thread count based on your needs:Recommendations
| Use Case | Recommended Threads | Rationale |
|---|---|---|
| Development/Testing | 5-10 | Easier debugging, clearer logs |
| Production | 30-50 | Optimal balance |
| High-bandwidth environments | 75-100 | Maximum throughput |
Concurrency Trade-offs
Benefits
- Speed: Dramatically faster scraping
- Efficiency: Better CPU and network utilization
- Scalability: Handles large datasets efficiently
Considerations
- Rate limiting: Too many threads may trigger anti-bot measures
- Memory usage: Each thread consumes memory
- Log readability: Parallel execution creates interleaved logs
Alternative Approaches
AsyncIO (Not Used)
Whileasyncio could provide similar benefits, ThreadPoolExecutor was chosen because:
- Simpler implementation for I/O-bound tasks
- Better compatibility with synchronous libraries (requests, psycopg2)
- Easier error handling and debugging
Process-Based Parallelism (Not Used)
multiprocessing.Pool was considered but rejected:
- Higher overhead: Process creation is expensive
- Shared state complexity: Database connections can’t be pickled
- Overkill: Scraping is I/O-bound, not CPU-bound
Example: Adjusting Thread Count
To change the thread pool size:Monitoring Concurrency
The scraper logs concurrent operations:Next Steps
Scraping Engine
Learn about the scraping implementation
Network Evasion
Explore proxy and TOR integration