Purpose
The BR-ACC ETL framework (bracc-etl) is a production data ingestion system that extracts, transforms, and loads data from 45+ Brazilian public data sources into a Neo4j knowledge graph. The framework is designed for:
- Scale: Processing millions of entities (50M+ companies, 80M+ people)
- Reliability: Production-grade error handling, retry logic, and data validation
- Transparency: Full audit trails with
IngestionRunnodes tracking every pipeline execution - Extensibility: Plugin architecture for adding new data sources
Architecture
The ETL framework follows a modular pipeline architecture:Extract
Download raw data from government portals, APIs, and BigQuery exports
Transform
Normalize names, deduplicate records, validate documents, resolve entities
Load
Bulk insert into Neo4j using UNWIND for efficient batch processing
Key Features
Multi-Format Support
The framework handles diverse data formats:- CSV files: Headerless Receita Federal CSVs (
;delimiter,latin-1encoding) - BigQuery exports: Base dos Dados datasets with automatic column mapping
- REST APIs: TSE, Câmara, Senado open data APIs
- Streaming ingestion: Memory-efficient chunk processing for 100GB+ datasets
Data Quality
Built-in validation and normalization:- Document validation (CPF/CNPJ with check-digit verification)
- Name normalization (case, diacritics, whitespace)
- Date parsing (handles multiple Brazilian date formats)
- Schema validation using Pandera
- Deduplication on primary keys
Operational Monitoring
Every pipeline run creates anIngestionRun node in Neo4j:
running, loaded, quality_fail
Pipeline Base Class
All pipelines inherit fromPipeline (see etl/src/bracc_etl/base.py:11):
Data Sources
The framework currently ingests 45+ data sources across 10 categories:| Category | Sources | Examples |
|---|---|---|
| Identity | 1 | CNPJ (50M companies) |
| Electoral | 3 | TSE elections, donations, candidate assets |
| Contracts | 4 | Portal da Transparência, ComprasNet, PNCP |
| Sanctions | 6 | CEIS, CNEP, OFAC, EU, UN, World Bank |
| Fiscal | 3 | PGFN debt, tax waivers, SICONFI |
| Finance | 5 | BNDES loans, BCB penalties, CVM |
| Legislative | 4 | Câmara/Senado expenses, inquiries, CPIs |
| Judiciary | 2 | STF court data, DataJud |
| Environment | 1 | IBAMA embargos |
| International | 2 | ICIJ Offshore Leaks, OpenSanctions PEPs |
Tech Stack
Python 3.12+
Modern Python with type hints and async support
Neo4j 5.x
Graph database for relationship-heavy data
Pandas
Vectorized data transformation and CSV processing
Click
CLI interface with rich options
Dependencies
Core dependencies (seeetl/pyproject.toml:7):
neo4j>=5.27.0— Database driverpandas>=2.2.0— Data processinghttpx>=0.28.0— HTTP client for downloadsclick>=8.1.0— CLI frameworkpydantic>=2.10.0— Settings managementpandera>=0.21.0— Schema validation
bigquery— Google BigQuery integrationresolution— Splink entity resolution
Performance
Production benchmarks:| Pipeline | Records | Processing Time | Mode |
|---|---|---|---|
| CNPJ | 50M companies + 80M partners | 6 hours | Streaming |
| TSE | 30M donation records | 45 minutes | Batch |
| Transparência | 10M contracts | 20 minutes | Batch |
| PGFN | 25M debt records | 1 hour | Streaming |
- Batch mode: Loads full dataset into memory (up to 32GB)
- Streaming mode: Fixed 2GB memory footprint with chunking
Next Steps
Pipeline Architecture
Learn about pipeline design patterns and the ETL workflow
Running Pipelines
Run your first pipeline locally
Creating Pipelines
Build a new pipeline for a custom data source
Data Sources
Browse the complete data source registry