Skip to main content

Common Issues

Problem:
ERROR: No URLs found in outputs/urls/jiji_urls.csv
Cause: You’re trying to run a listing spider before collecting URLs.Solution:Run URL collection first:
python main.py
# Select: "Collect listing URLs" → Choose site
Or run URL spider directly:
scrapy crawl jiji_urls -a start_page=1 -a max_pages=5
scrapy crawl meqasa_urls -a start_page=1 -a total_pages=5
Verify URLs were collected:
ls -lh outputs/urls/
# Should show jiji_urls.csv and/or meqasa_urls.csv

wc -l outputs/urls/jiji_urls.csv
# Should show more than 1 line (header + data)
Problem:
playwright._impl._api_types.Error: Executable doesn't exist
or
Browser closed unexpectedly
Cause: Playwright browser binaries not installed.Solution:Install Chromium browser:
python -m playwright install chromium
On Linux, also install system dependencies:
python -m playwright install-deps chromium
Verify installation:
python -m playwright install --help
# Should show available browsers including chromium
Alternative - use system Chromium:Edit property_bot/settings.py:
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": True,
    "executable_path": "/usr/bin/chromium-browser",  # Adjust path
}
Problem: Many fields are empty or missing in output CSV:
  • bedrooms, bathrooms, price are NaN
  • amenities is empty []
  • description is blank
Cause: Target site HTML structure changed, CSS selectors no longer match.Diagnosis:
  1. Visit a listing URL manually and inspect HTML
  2. Check if selectors still match:
Jiji selectors (in jiji_listing.py:72-144):
# Title
"h1 div::text, .b-advert-title-outer h1::text"

# Location  
".b-advert-info-statistics--region::text"

# Price
".b-alt-advert-price-wrapper span.qa-advert-price-view-value::text"

# Properties
".b-advert-attribute"

# Amenities
".b-advert-attributes__tag"

# Description
".qa-description-text::text"
Meqasa selectors (in meqasa_listing.py:77-102):
# Title
"h1::text"

# Price
".price-wrapper > div:nth-child(1)::text"

# Details table
"table.table tr"

# Description
".description p::text"
Solution:Update CSS selectors in spider files:
  1. Open browser DevTools on target listing page
  2. Find correct selectors for each field
  3. Update spider parse methods with new selectors
  4. Test with single URL:
scrapy shell "https://jiji.com.gh/..."
>>> response.css("YOUR_NEW_SELECTOR::text").get()
Example fix for Jiji price selector:
# Old selector (stopped working)
price = response.css(".qa-advert-price-view-value::text").get()

# Updated selector (inspect actual HTML)
price = response.css(".price-container .amount::text").get()
Problem:
ERROR: Could not find a version that satisfies the requirement scrapy
or
error: externally-managed-environment
Cause:
  • Wrong Python version (need 3.12+)
  • System Python blocking pip install
  • Missing build dependencies
Solution:Check Python version:
python --version
# Should show Python 3.12.x or higher
Install with uv (recommended):
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Sync dependencies
uv sync
Install with venv + pip:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .
On Linux, install build dependencies:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install python3-dev build-essential libssl-dev libffi-dev

# Fedora/RHEL
sudo dnf install python3-devel gcc openssl-devel libffi-devel
Problem:
HTTP 403 Forbidden
or many requests timing out after initial success.Cause:
  • Target site detecting bot traffic
  • Too many concurrent requests
  • Missing or blocked user agent
Solution:Reduce concurrency in property_bot/settings.py:
CONCURRENT_REQUESTS = 8  # Reduce from 16
CONCURRENT_REQUESTS_PER_DOMAIN = 4  # Reduce from 8
DOWNLOAD_DELAY = 2  # Increase from 0.5
Enable autothrottle (already enabled by default):
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
Verify user agent is set:
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ..."
Check robots.txt compliance:
ROBOTSTXT_OBEY = True  # Already set in project
Monitor retry stats:
# Look for retry logs during scrape
# Scrapy will show retry attempts and status codes
Problem:
KeyError: 'location'
or
ValueError: could not convert string to float: 'Negotiable'
Cause: Unexpected data format in raw CSV.Solution:Check raw data:
head -n 5 outputs/data/jiji_data.csv
# Verify expected columns exist
Common fixes:
  1. Missing location column:
# In clean.py, add check before extraction
if "location" in df.columns:
    df["locality"] = df["location"].apply(extract_locality)
    df = df.drop(columns=["location"])
else:
    df["locality"] = "Unknown"
  1. Non-numeric price values:
# Filter out non-numeric prices before conversion
df = df[df["price"].str.contains(r"GH₵\s*\d", na=False)]
df["price"] = (
    df["price"]
    .str.replace("GH₵ ", "", regex=False)
    .str.replace(",", "", regex=False)
    .astype(float)
)
  1. Properties field not JSON:
# Add error handling in expand_column
def expand_column(df, col, sep=None):
    try:
        if sep:
            encoded = df[col].str.strip().str.get_dummies(sep=sep)
        else:
            encoded = df[col].apply(ast.literal_eval).apply(pd.Series)
        return df.join(encoded).drop(columns=[col])
    except Exception as e:
        print(f"Warning: Could not expand {col}: {e}")
        return df
Problem: Resume mode re-scrapes URLs that were already scraped.Cause:
  • URL mismatch between URL CSV and data CSV
  • Data CSV was deleted or moved
  • URLs in data CSV have different format (trailing slash, query params)
Diagnosis:
# Compare URL formats
head -n 3 outputs/urls/jiji_urls.csv
head -n 3 outputs/data/jiji_data.csv

# Count unique URLs in each file
tail -n +2 outputs/urls/jiji_urls.csv | cut -d',' -f1 | sort | uniq | wc -l
tail -n +2 outputs/data/jiji_data.csv | cut -d',' -f1 | sort | uniq | wc -l
Solution:URLs should match exactly. If they don’t:
  1. Check for URL normalization issues:
# In listing spider start_requests(), normalize URLs:
for entry in self.urls:
    url = entry["url"].rstrip("/")  # Remove trailing slash
    yield scrapy.Request(url, ...)
  1. Verify URL deduplication is working:
Check _seen_urls is loading correctly in base_spider.py:79-93:
if hasattr(self, "OUTPUT_CSV") and self.OUTPUT_CSV.exists():
    with open(self.OUTPUT_CSV, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            val = row.get(self.URL_FIELD, "").strip()
            if val:
                self._seen_urls.add(val)
  1. Manually create resume queue:
# Extract missing URLs
comm -23 \
  <(tail -n +2 outputs/urls/jiji_urls.csv | cut -d',' -f1 | sort) \
  <(tail -n +2 outputs/data/jiji_data.csv | cut -d',' -f1 | sort) \
  > missing_urls.txt

# Create resume queue CSV
echo "url" > outputs/urls/jiji_resume_queue.csv
cat missing_urls.txt >> outputs/urls/jiji_resume_queue.csv

# Run with custom queue
scrapy crawl jiji_listings -a csv_path=outputs/urls/jiji_resume_queue.csv

Debugging Tips

Enable Debug Logging

# Run spider with debug logs
scrapy crawl jiji_listings -L DEBUG

# Save logs to file
scrapy crawl jiji_listings -L DEBUG --logfile=debug.log

Test Single URL in Scrapy Shell

# Interactive shell for testing selectors
scrapy shell "https://jiji.com.gh/greater-accra/..."

# Test selectors
>>> response.css("h1::text").get()
>>> response.css(".qa-advert-price::text").get()
>>> response.css(".b-advert-attribute::text").getall()

Inspect Playwright Context

# In spider parse method, add debugging
async def parse(self, response):
    page = response.meta.get("playwright_page")
    
    # Take screenshot
    if page:
        await page.screenshot(path="debug.png")
    
    # Print page content
    content = await page.content()
    self.logger.debug(f"Page HTML: {content[:500]}")

Check CSV Output

# Validate CSV structure
head -n 1 outputs/data/jiji_data.csv  # Check headers
tail -n 5 outputs/data/jiji_data.csv  # Check recent entries

# Count records
wc -l outputs/data/jiji_data.csv

# Find empty fields
cut -d',' -f7 outputs/data/jiji_data.csv | grep -c '^$'  # Empty prices

Verify Playwright Installation

# List installed browsers
python -m playwright install --help

# Test browser launch
python -c "from playwright.sync_api import sync_playwright; \
  with sync_playwright() as p: \
    browser = p.chromium.launch(); \
    print('Browser launched successfully'); \
    browser.close()"

Getting Help

Check Scrapy Stats

At end of spider run, Scrapy prints useful statistics:
'downloader/request_count': 150
'downloader/response_count': 145
'downloader/response_status_count/200': 140
'downloader/response_status_count/403': 5
'item_scraped_count': 138
'retry/count': 8
'retry/max_reached': 2
Key metrics:
  • item_scraped_count - Total items successfully saved
  • response_status_count/403 - Blocked requests
  • retry/max_reached - Failed after max retries

Review Settings

Key settings in property_bot/settings.py:218-226:
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5

# Playwright browser options
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": True,
    "args": ["--disable-blink-features=AutomationControlled", ...]
}

Compare Output Schemas

Expected fields from spiders: Jiji: url, fetch_date, title, location, house_type, bedrooms, bathrooms, price, properties, amenities, description Meqasa: url, Title, Price, Rate, Description, fetch_date, Categories, Lease options, Bedrooms, Bathrooms, Garage, Furnished, Amenities, Address, Reference, details

Known Limitations

  • Meqasa cleaning not implemented - Only Jiji data goes through clean.py
  • No proxy rotation - May hit rate limits on large scrapes
  • No CAPTCHA handling - Sites may require manual intervention
  • Selector fragility - HTML changes require selector updates

Workarounds

For large scrapes:
  • Split into smaller batches (50-100 pages at a time)
  • Use DOWNLOAD_DELAY and AUTOTHROTTLE
  • Run during off-peak hours
For rate limiting:
  • Reduce CONCURRENT_REQUESTS
  • Increase AUTOTHROTTLE_START_DELAY
  • Add random delays in spider logic
For data quality:
  • Manually review samples from output CSVs
  • Test selectors in Scrapy shell before large runs
  • Monitor failures count during scrape

Build docs developers (and LLMs) love