System Overview
BR-ACC is built as a modern, containerized application stack that ingests Brazilian public data, normalizes it into a graph database, and exposes it through a REST API and web interface.Architecture Layers
1. Data Sources Layer
BR-ACC integrates 45+ Brazilian public data sources:- Government Portals
- Sector-Specific
- International
- Derived
- Receita Federal - CNPJ company registry (60M+ companies)
- Portal da Transparência - Federal spending, sanctions, contracts
- TSE - Electoral data, donations, candidate assets, party membership
- ComprasNet/PNCP - Federal procurement and bids
- TCU - Federal Court of Auditors sanctions
- TransfereGov - Federal transfers to states/municipalities
2. ETL Layer
The ETL (Extract, Transform, Load) layer is built with Python and follows a modular pipeline architecture.Pipeline Structure
Each data source has a dedicated pipeline inetl/src/bracc_etl/pipelines/:
Common Transforms
Shared transformation utilities inetl/src/bracc_etl/transforms/:
ETL Orchestration
Frometl/src/bracc_etl/runner.py:
- Loads source contract from
config/bootstrap_all_contract.yml - Runs pipelines in dependency order
- Continues on errors and classifies outcomes
- Generates audit reports in
audit-results/bootstrap-all/
3. Graph Database Layer
Neo4j 5 Community Edition stores all data as a labeled property graph.Schema Overview
Frominfra/neo4j/init.cypher, the schema defines:
- Core Entities
- Transactions
- Sanctions & Restrictions
- Sector-Specific
Person - Individuals (CPF)Company - Organizations (CNPJ)Partner - Company partners/shareholders
Graph Relationships
Common relationship types:Query Examples
Fromapi/src/bracc/queries/:
4. Backend API Layer
FastAPI (Python 3.12+) provides async REST API with automatic OpenAPI documentation.Project Structure
Key Endpoints
Fromapi/src/bracc/routers/public.py and meta.py:
- Public Endpoints
- Health & Stats
- Pattern Detection
Privacy & Security
Fromapi/src/bracc/main.py:
5. Frontend Layer
React 19 + TypeScript + Vite provides the web interface.Tech Stack
Fromfrontend/package.json:
Key Features
- Company Search - Search by CNPJ, name, or ID
- Graph Visualization - Interactive force-directed graph with react-force-graph-2d
- Entity Details - Comprehensive entity information panels
- Investigation Workspace - Create and manage investigation collections
- Pattern Highlighting - Visual indicators for detected patterns
- Multi-language - Portuguese (pt-BR) and English (en) via i18next
Docker Compose Architecture
Fromdocker-compose.yml:
Data Flow Example
Let’s trace a complete request flow:Enforce privacy policies
Filter results through
public_guard.py:- Remove Person nodes if
PUBLIC_MODE=true - Strip CPF fields from all nodes
- Filter sensitive properties
Performance Considerations
Neo4j Optimization
Memory Configuration
Memory Configuration
For production scale (40M+ nodes):Requires 32GB+ RAM (64GB recommended)
Query Optimization
Query Optimization
- All unique constraints create indexes automatically
- Use
elementId()for fast node lookup - Limit depth in graph traversals (max 3 recommended)
- Use APOC procedures for complex graph operations
Batch Loading
Batch Loading
ETL uses
Neo4jBatchLoader with:- Batch size: 1000-5000 nodes
- UNWIND for bulk inserts
- Periodic commits for large datasets
API Performance
Caching
Caching
- Stats endpoint cached for 5 minutes
- Neo4j connection pooling (default: 50 connections)
- Response compression via FastAPI middleware
Rate Limiting
Rate Limiting
Async Operations
Async Operations
- FastAPI with async/await throughout
- Neo4j async driver (neo4j.AsyncDriver)
- Concurrent query execution where possible
Security Architecture
Defense in Depth
-
Environment-based Configuration
- Public mode disables sensitive endpoints
- Tier system (community vs enterprise)
- Feature flags for experimental features
-
Middleware Stack
- Security headers (CSP, HSTS, X-Frame-Options)
- CPF masking in all responses
- Rate limiting per IP
- CORS with explicit origin whitelist
-
Data Access Control
- Public-safe defaults block Person nodes
- Property-level sanitization
- Query-level enforcement (not relying on frontend)
-
Authentication (when enabled)
- JWT tokens with secure secret
- Password hashing with bcrypt
- Invite-code system for registration
-
Audit Trail
- All ingestion runs logged to graph
- Source attribution on every node
- Temporal violation tracking
Deployment Architectures
- Local Development
- Production Single-Server
- Production Distributed
- Neo4j, API, Frontend in containers
- Seed data for testing
- Hot reload for development
Monitoring & Observability
Neo4j Metrics
- Query performance via
dbms.querylog - Memory usage
- Transaction throughput
- Cache hit rates
API Metrics
- Request rate and latency
- Error rates by endpoint
- Rate limit hits
- Health check status
ETL Metrics
- Pipeline success/failure rates
- Data quality metrics
- Source availability
- Ingestion duration
System Metrics
- CPU and memory usage
- Disk I/O
- Network throughput
- Docker container health
Next Steps
API Reference
Explore all available endpoints
ETL Deep Dive
Learn about data pipelines
Graph Schema
Complete Neo4j schema reference
Contributing
Contribute to the project