Common Issues
No URLs found in CSV
No URLs found in CSV
Problem:Cause:
You’re trying to run a listing spider before collecting URLs.Solution:Run URL collection first:Or run URL spider directly:Verify URLs were collected:
Browser launch failures
Browser launch failures
Problem:orCause:
Playwright browser binaries not installed.Solution:Install Chromium browser:On Linux, also install system dependencies:Verify installation:Alternative - use system Chromium:Edit
property_bot/settings.py:Empty or partial field extraction
Empty or partial field extraction
Problem:
Many fields are empty or missing in output CSV:Meqasa selectors (in Solution:Update CSS selectors in spider files:Example fix for Jiji price selector:
bedrooms,bathrooms,priceareNaNamenitiesis empty[]descriptionis blank
- Visit a listing URL manually and inspect HTML
- Check if selectors still match:
jiji_listing.py:72-144):meqasa_listing.py:77-102):- Open browser DevTools on target listing page
- Find correct selectors for each field
- Update spider parse methods with new selectors
- Test with single URL:
Dependency errors on install
Dependency errors on install
Problem:orCause:Install with uv (recommended):Install with venv + pip:On Linux, install build dependencies:
- Wrong Python version (need 3.12+)
- System Python blocking pip install
- Missing build dependencies
Rate limiting or 403 errors
Rate limiting or 403 errors
Problem:or many requests timing out after initial success.Cause:Enable autothrottle (already enabled by default):Verify user agent is set:Check robots.txt compliance:Monitor retry stats:
- Target site detecting bot traffic
- Too many concurrent requests
- Missing or blocked user agent
property_bot/settings.py:Cleaning script errors
Cleaning script errors
Problem:orCause:
Unexpected data format in raw CSV.Solution:Check raw data:Common fixes:
- Missing location column:
- Non-numeric price values:
- Properties field not JSON:
Resume mode not working
Resume mode not working
Problem:
Resume mode re-scrapes URLs that were already scraped.Cause:Solution:URLs should match exactly. If they don’t:
- URL mismatch between URL CSV and data CSV
- Data CSV was deleted or moved
- URLs in data CSV have different format (trailing slash, query params)
- Check for URL normalization issues:
- Verify URL deduplication is working:
_seen_urls is loading correctly in base_spider.py:79-93:- Manually create resume queue:
Debugging Tips
Enable Debug Logging
Test Single URL in Scrapy Shell
Inspect Playwright Context
Check CSV Output
Verify Playwright Installation
Getting Help
Check Scrapy Stats
At end of spider run, Scrapy prints useful statistics:item_scraped_count- Total items successfully savedresponse_status_count/403- Blocked requestsretry/max_reached- Failed after max retries
Review Settings
Key settings inproperty_bot/settings.py:218-226:
Compare Output Schemas
Expected fields from spiders: Jiji:url, fetch_date, title, location, house_type, bedrooms, bathrooms, price, properties, amenities, description
Meqasa: url, Title, Price, Rate, Description, fetch_date, Categories, Lease options, Bedrooms, Bathrooms, Garage, Furnished, Amenities, Address, Reference, details
Known Limitations
Workarounds
For large scrapes:- Split into smaller batches (50-100 pages at a time)
- Use
DOWNLOAD_DELAYandAUTOTHROTTLE - Run during off-peak hours
- Reduce
CONCURRENT_REQUESTS - Increase
AUTOTHROTTLE_START_DELAY - Add random delays in spider logic
- Manually review samples from output CSVs
- Test selectors in Scrapy shell before large runs
- Monitor
failurescount during scrape