Overview
Theproperty_bot/settings.py file contains all core Scrapy settings and Playwright browser configuration. The scraper is optimized for high-concurrency headless scraping with intelligent throttling and retry mechanisms.
Key Scrapy Settings
Robots.txt Compliance
robots.txt directives from target websites. This ensures ethical scraping practices.
Concurrency Settings
- CONCURRENT_REQUESTS: Maximum of 32 simultaneous requests across all domains
- CONCURRENT_REQUESTS_PER_DOMAIN: Up to 32 concurrent requests per individual domain
- DOWNLOAD_DELAY: 0.5 second delay between consecutive requests to the same domain
- DOWNLOAD_TIMEOUT: 35 second timeout for page downloads
Autothrottle Configuration
- AUTOTHROTTLE_ENABLED: Enables dynamic throttling
- AUTOTHROTTLE_START_DELAY: Initial delay of 0.5 seconds
- AUTOTHROTTLE_MAX_DELAY: Maximum delay caps at 5 seconds
- AUTOTHROTTLE_TARGET_CONCURRENCY: Targets 16 concurrent requests to each domain
Request Retry Settings
- RETRY_TIMES: Failed requests are retried up to 3 times
- RETRY_HTTP_CODES: Automatically retries on:
500- Internal Server Error502- Bad Gateway503- Service Unavailable504- Gateway Timeout408- Request Timeout429- Too Many Requests403- Forbidden (transient blocks)
Download Handlers
Logging Configuration
Output Directories
outputs/urls/- URL collection CSVsoutputs/data/- Listing detail CSVs