Skip to main content

Purpose

The BR-ACC ETL framework (bracc-etl) is a production data ingestion system that extracts, transforms, and loads data from 45+ Brazilian public data sources into a Neo4j knowledge graph. The framework is designed for:
  • Scale: Processing millions of entities (50M+ companies, 80M+ people)
  • Reliability: Production-grade error handling, retry logic, and data validation
  • Transparency: Full audit trails with IngestionRun nodes tracking every pipeline execution
  • Extensibility: Plugin architecture for adding new data sources

Architecture

The ETL framework follows a modular pipeline architecture:

Extract

Download raw data from government portals, APIs, and BigQuery exports

Transform

Normalize names, deduplicate records, validate documents, resolve entities

Load

Bulk insert into Neo4j using UNWIND for efficient batch processing

Key Features

Multi-Format Support

The framework handles diverse data formats:
  • CSV files: Headerless Receita Federal CSVs (; delimiter, latin-1 encoding)
  • BigQuery exports: Base dos Dados datasets with automatic column mapping
  • REST APIs: TSE, Câmara, Senado open data APIs
  • Streaming ingestion: Memory-efficient chunk processing for 100GB+ datasets

Data Quality

Built-in validation and normalization:
  • Document validation (CPF/CNPJ with check-digit verification)
  • Name normalization (case, diacritics, whitespace)
  • Date parsing (handles multiple Brazilian date formats)
  • Schema validation using Pandera
  • Deduplication on primary keys

Operational Monitoring

Every pipeline run creates an IngestionRun node in Neo4j:
MATCH (r:IngestionRun)
WHERE r.source_id = 'cnpj'
RETURN r.status, r.rows_in, r.rows_loaded, r.started_at
ORDER BY r.started_at DESC
LIMIT 1
Statuses: running, loaded, quality_fail

Pipeline Base Class

All pipelines inherit from Pipeline (see etl/src/bracc_etl/base.py:11):
class Pipeline(ABC):
    name: str
    source_id: str

    @abstractmethod
    def extract(self) -> None:
        """Download raw data from source."""

    @abstractmethod
    def transform(self) -> None:
        """Normalize, deduplicate, and prepare data for loading."""

    @abstractmethod
    def load(self) -> None:
        """Load transformed data into Neo4j."""

    def run(self) -> None:
        """Execute the full ETL pipeline."""
        self.extract()
        self.transform()
        self.load()

Data Sources

The framework currently ingests 45+ data sources across 10 categories:
CategorySourcesExamples
Identity1CNPJ (50M companies)
Electoral3TSE elections, donations, candidate assets
Contracts4Portal da Transparência, ComprasNet, PNCP
Sanctions6CEIS, CNEP, OFAC, EU, UN, World Bank
Fiscal3PGFN debt, tax waivers, SICONFI
Finance5BNDES loans, BCB penalties, CVM
Legislative4Câmara/Senado expenses, inquiries, CPIs
Judiciary2STF court data, DataJud
Environment1IBAMA embargos
International2ICIJ Offshore Leaks, OpenSanctions PEPs
See Data Sources for the complete registry.

Tech Stack

Python 3.12+

Modern Python with type hints and async support

Neo4j 5.x

Graph database for relationship-heavy data

Pandas

Vectorized data transformation and CSV processing

Click

CLI interface with rich options

Dependencies

Core dependencies (see etl/pyproject.toml:7):
  • neo4j>=5.27.0 — Database driver
  • pandas>=2.2.0 — Data processing
  • httpx>=0.28.0 — HTTP client for downloads
  • click>=8.1.0 — CLI framework
  • pydantic>=2.10.0 — Settings management
  • pandera>=0.21.0 — Schema validation
Optional extras:
  • bigquery — Google BigQuery integration
  • resolution — Splink entity resolution

Performance

Production benchmarks:
PipelineRecordsProcessing TimeMode
CNPJ50M companies + 80M partners6 hoursStreaming
TSE30M donation records45 minutesBatch
Transparência10M contracts20 minutesBatch
PGFN25M debt records1 hourStreaming
Memory usage:
  • Batch mode: Loads full dataset into memory (up to 32GB)
  • Streaming mode: Fixed 2GB memory footprint with chunking

Next Steps

Pipeline Architecture

Learn about pipeline design patterns and the ETL workflow

Running Pipelines

Run your first pipeline locally

Creating Pipelines

Build a new pipeline for a custom data source

Data Sources

Browse the complete data source registry

Build docs developers (and LLMs) love