Spider Architecture - ScrapeAccraProperties

ScrapeAccraProperties uses a shared base class to provide consistent CSV writing, progress tracking, and URL deduplication across all spiders.

PropertyBaseSpider

All spiders inherit from PropertyBaseSpider, which extends Scrapy’s base Spider class with property scraping functionality.

Key Features

Incremental CSV Writes

Write items to CSV one-by-one as they’re scraped, with automatic deduplication

Progress Tracking

Real-time UI with scraped count, speed, errors, and page progress

URL Deduplication

Automatic detection and skipping of duplicate URLs across runs

Resume Support

Track existing URLs on startup to enable resume mode

Class Definition

# From base_spider.py:55-69
class PropertyBaseSpider(scrapy.Spider):
    OUTPUT_CSV: Path
    URL_FIELD: str = "url"
    OUTPUT_FIELDS: tuple[str, ...] | None = None

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time_ts = time.time()
        self.scraped_count = 0
        self.failures = 0
        self._seen_urls = set()
        self._task_id = None
        self._csv_fieldnames = (
            list(self.OUTPUT_FIELDS) if self.OUTPUT_FIELDS is not None else []
        )

Required Attributes

Each spider must define:

OUTPUT_CSV

Path

required

Path to the output CSV file where scraped data will be written.

OUTPUT_FIELDS

tuple[str, ...]

required

Ordered list of field names for the CSV. If None, fields are dynamically discovered from the first item.

URL_FIELD

str

default:"url"

Name of the field containing the unique URL identifier for deduplication.

URL Deduplication

The base spider tracks URLs in two ways:

On spider startup — Read existing URLs from the output CSV
During scraping — Track newly scraped URLs in memory

Startup: Loading Existing URLs

# From base_spider.py:78-93
def _opened(self, spider):
    if hasattr(self, "OUTPUT_CSV") and self.OUTPUT_CSV.exists():
        try:
            with open(self.OUTPUT_CSV, newline="", encoding="utf-8") as f:
                reader = csv.DictReader(f)
                if reader.fieldnames:
                    self._csv_fieldnames = [k for k in reader.fieldnames if k]
                for row in reader:
                    try:
                        val = row.get(self.URL_FIELD, "").strip()
                        if val:
                            self._seen_urls.add(val)
                    except Exception:
                        continue
        except Exception:
            pass

This enables the spider to skip URLs that were already scraped in previous runs.

During Scraping: In-Memory Deduplication

# From base_spider.py:147-158
def save_item(self, item: dict):
    item = self._normalize_item(item)
    url_val = str(item.get(self.URL_FIELD, "")).strip()

    with tracker.csv_lock:
        if url_val and url_val in self._seen_urls:
            return  # Skip duplicate
        if url_val:
            self._seen_urls.add(url_val)
        # ... write to CSV

Deduplication happens at the URL level, not the item level. If two items have the same URL, only the first one is saved.

Incremental CSV Writing

Unlike Scrapy’s default feed exporters, PropertyBaseSpider writes items to CSV immediately after scraping:

# From base_spider.py:160-176
self.OUTPUT_CSV.parent.mkdir(parents=True, exist_ok=True)
file_exists = (
    self.OUTPUT_CSV.exists() and self.OUTPUT_CSV.stat().st_size > 0
)
fieldnames = self._get_output_fieldnames(item)
with open(self.OUTPUT_CSV, "a", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(
        f,
        fieldnames=fieldnames,
        extrasaction="ignore",
        quoting=csv.QUOTE_MINIMAL,
        lineterminator="\n",
    )
    if not file_exists:
        writer.writeheader()
    writer.writerow(item)

Benefits

Immediate persistence

If the spider crashes, all items scraped before the crash are already saved to CSV.

Resume-friendly

The CSV can be read during startup to determine which URLs have already been scraped.

Memory efficient

Items are written and released immediately, not accumulated in memory.

Data Serialization

Complex Python objects are serialized to JSON before writing:

# From base_spider.py:106-116
@staticmethod
def _serialize_for_csv(value: Any) -> Any:
    if value is None:
        return ""
    if isinstance(value, dict):
        return json.dumps(value, ensure_ascii=False, sort_keys=True)
    if isinstance(value, (list, tuple, set)):
        return json.dumps(list(value), ensure_ascii=False)
    if isinstance(value, str):
        return " ".join(value.split())  # Normalize whitespace
    return value

Progress Tracking

The SpiderProgressTracker class provides a shared Rich progress UI for all running spiders.

Progress Display

Each spider gets a live progress bar showing:

Scraped — Total items scraped
Speed — Items per minute
Errors — Failed requests
Page — Current page (for URL spiders) or “scraped/total” (for listing spiders)
Elapsed — Time since spider started

# From base_spider.py:20-37
class SpiderProgressTracker:
    def __init__(self):
        self.progress = Progress(
            SpinnerColumn(),
            TextColumn("[cyan bold]{task.description:<22}"),
            TextColumn("[green]{task.fields[scraped]:>6,} scraped"),
            TextColumn("[yellow]{task.fields[speed]:>5.0f}/min"),
            TextColumn("[red]{task.fields[errors]:>2} err"),
            TextColumn("[dim]{task.fields[page]:<10}"),
            TimeElapsedColumn(),
            console=console,
            transient=False,
            refresh_per_second=10,
        )

Updating Progress

Spiders call update_ui() to refresh the progress display:

# From base_spider.py:177-194
def update_ui(self, current_page=None, total_pages=None):
    if self._task_id is None:
        return

    elapsed = time.time() - self.start_time_ts
    speed = (self.scraped_count / elapsed) * 60 if elapsed > 0 else 0

    page_text = f"p.{current_page}" if current_page else ""
    if total_pages:
        page_text += f"/{total_pages}"

    tracker.progress.update(
        self._task_id,
        scraped=self.scraped_count,
        speed=speed,
        errors=self.failures,
        page=page_text,
    )

Final Summary

When a spider completes, a summary panel is displayed:

# From base_spider.py:196-207
def _print_summary(self):
    mins, secs = divmod(int(time.time() - self.start_time_ts), 60)
    t = Table(show_header=False, box=box.SIMPLE, padding=(0, 2))
    t.add_column(style="dim")
    t.add_column(style="bold")
    t.add_row("Spider", f"[cyan]{self.name.upper()}[/]")
    t.add_row("Scraped", f"[green]{self.scraped_count:,}[/]")
    t.add_row("Errors", f"[red]{self.failures}[/]")
    t.add_row("Duration", f"{mins}m {secs}s")
    console.print(
        Panel(t, title="[bold] Complete[/]", border_style="green", expand=False)
    )

URL Spiders

URL spiders collect listing URLs from search result pages.

Jiji URL Spider

# From jiji_urls.py:16-33
class JijiUrlSpider(PropertyBaseSpider):
    name = "jiji_urls"
    OUTPUT_CSV = PROJECT_ROOT / "outputs" / "urls" / "jiji_urls.csv"
    URL_FIELD = "url"
    OUTPUT_FIELDS = ("url", "page", "fetch_date")

    def __init__(
        self, start_page=1, max_pages=None, total_listing=None, *args, **kwargs
    ):
        super().__init__(*args, **kwargs)
        self.start_page = int(start_page)
        self.max_pages = (
            math.ceil(int(total_listing) / LISTINGS_PER_PAGE)
            if total_listing
            else (int(max_pages) if max_pages else None)
        )
        self.total_count = self.max_pages
        self._detected = False

start_page

int

default:"1"

First page to start scraping from

max_pages

int

Fixed number of pages to scrape (if not using auto-detection)

total_listing

int

Expected total listings to convert to page count (20 listings per page)

Meqasa URL Spider

# From meqasa_urls.py:16-27
class MeqasaUrlSpider(PropertyBaseSpider):
    name = "meqasa_urls"
    OUTPUT_CSV = PROJECT_ROOT / "outputs" / "urls" / "meqasa_urls.csv"
    URL_FIELD = "url"
    OUTPUT_FIELDS = ("url", "page", "fetch_date")

    def __init__(self, start_page=1, total_pages=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_page = int(start_page)
        self.total_pages = int(total_pages) if total_pages else None
        self.total_count = self.total_pages
        self._detected = False

start_page

int

default:"1"

First page to start scraping from

total_pages

int

Fixed number of pages to scrape (if not using auto-detection)

Common Pattern: Dynamic Request Generation

Both URL spiders use the same pattern:

If max_pages or total_pages is provided, generate all requests immediately
Otherwise, make a single request with is_detector=True to auto-detect total count
Once detected, generate remaining requests

# From jiji_urls.py:35-42
def start_requests(self):
    if self.max_pages:
        yield from (
            self._make_request(p)
            for p in range(self.start_page, self.start_page + self.max_pages)
        )
    else:
        yield self._make_request(self.start_page, is_detector=True)

Listing Spiders

Listing spiders read URLs from CSV files and extract detailed property data.

Jiji Listing Spider

# From jiji_listing.py:13-37
class JijiListingSpider(PropertyBaseSpider):
    name = "jiji_listings"
    OUTPUT_CSV = PROJECT_ROOT / "outputs" / "data" / "jiji_data.csv"
    URL_FIELD = "url"
    OUTPUT_FIELDS = (
        "url",
        "fetch_date",
        "title",
        "location",
        "house_type",
        "bedrooms",
        "bathrooms",
        "price",
        "properties",
        "amenities",
        "description",
    )

    def __init__(self, csv_path="outputs/urls/jiji_urls.csv", *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.csv_path = (
            Path(csv_path) if Path(csv_path).is_absolute() else PROJECT_ROOT / csv_path
        )
        self.urls = self._load_urls()
        self.total_count = len(self.urls)

csv_path

str

default:"outputs/urls/jiji_urls.csv"

Path to the URL CSV file (relative to project root or absolute)

URL Loading

Listing spiders load URLs from CSV on initialization:

# From jiji_listing.py:39-49
def _load_urls(self):
    today = datetime.now().strftime("%Y-%m-%d")
    urls = []
    if self.csv_path.exists():
        with open(self.csv_path, "r", encoding="utf-8") as f:
            for row in csv.DictReader(f):
                if url := row.get("url", "").strip():
                    urls.append(
                        {"url": url, "fetch_date": row.get("fetch_date") or today}
                    )
    return urls

Meqasa Listing Spider

# From meqasa_listing.py:12-41
class MeqasaListingSpider(PropertyBaseSpider):
    name = "meqasa_listings"
    OUTPUT_CSV = PROJECT_ROOT / "outputs" / "data" / "meqasa_data.csv"
    URL_FIELD = "url"
    OUTPUT_FIELDS = (
        "url",
        "Title",
        "Price",
        "Rate",
        "Description",
        "fetch_date",
        "Categories",
        "Lease options",
        "Bedrooms",
        "Bathrooms",
        "Garage",
        "Furnished",
        "Amenities",
        "Address",
        "Reference",
        "details",
    )

    def __init__(self, csv_path="outputs/urls/meqasa_urls.csv", *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.csv_path = (
            Path(csv_path) if Path(csv_path).is_absolute() else PROJECT_ROOT / csv_path
        )
        self.urls = self._load_urls()
        self.total_count = len(self.urls)

csv_path

str

default:"outputs/urls/meqasa_urls.csv"

Path to the URL CSV file (relative to project root or absolute)

Error Handling

All spiders use an async error callback for Playwright requests:

# From base_spider.py:209-213
async def errback_close_page(self, failure):
    self.failures += 1
    if page := failure.request.meta.get("playwright_page"):
        await page.close()

This ensures that Playwright page objects are properly closed even when requests fail, preventing memory leaks.

Usage Examples

Running URL Spiders Directly

scrapy crawl jiji_urls -a start_page=1

Running Listing Spiders Directly

scrapy crawl jiji_listings

Get Started

Core Concepts

Usage

Commands

Configuration

Reference

​PropertyBaseSpider

​Key Features

Incremental CSV Writes

Progress Tracking

URL Deduplication

Resume Support

​Class Definition

​Required Attributes

​URL Deduplication

​Startup: Loading Existing URLs

​During Scraping: In-Memory Deduplication

​Incremental CSV Writing

​Benefits

​Data Serialization

​Progress Tracking

​Progress Display

​Updating Progress

​Final Summary

​URL Spiders

​Jiji URL Spider

​Meqasa URL Spider

​Common Pattern: Dynamic Request Generation

​Listing Spiders

​Jiji Listing Spider

​URL Loading

​Meqasa Listing Spider

​Error Handling

​Usage Examples

​Running URL Spiders Directly

​Running Listing Spiders Directly

​Next Steps

Data Schema

Configuration

Build docs developers (and LLMs) love

PropertyBaseSpider

Key Features

Class Definition

Required Attributes

URL Deduplication

Startup: Loading Existing URLs

During Scraping: In-Memory Deduplication

Incremental CSV Writing

Benefits

Data Serialization

Progress Tracking

Progress Display

Updating Progress

Final Summary

URL Spiders

Jiji URL Spider

Meqasa URL Spider

Common Pattern: Dynamic Request Generation

Listing Spiders

Jiji Listing Spider

URL Loading

Meqasa Listing Spider

Error Handling

Usage Examples

Running URL Spiders Directly

Running Listing Spiders Directly

Next Steps