Skip to main content

System Overview

BR-ACC is built as a modern, containerized application stack that ingests Brazilian public data, normalizes it into a graph database, and exposes it through a REST API and web interface.

Architecture Layers

1. Data Sources Layer

BR-ACC integrates 45+ Brazilian public data sources:
  • Receita Federal - CNPJ company registry (60M+ companies)
  • Portal da Transparência - Federal spending, sanctions, contracts
  • TSE - Electoral data, donations, candidate assets, party membership
  • ComprasNet/PNCP - Federal procurement and bids
  • TCU - Federal Court of Auditors sanctions
  • TransfereGov - Federal transfers to states/municipalities
All sources are accessed through their official APIs or open data portals. No scraping of private data occurs.

2. ETL Layer

The ETL (Extract, Transform, Load) layer is built with Python and follows a modular pipeline architecture.

Pipeline Structure

Each data source has a dedicated pipeline in etl/src/bracc_etl/pipelines/:
# Example: etl/src/bracc_etl/pipelines/cnpj.py
from bracc_etl.base import Pipeline
from bracc_etl.loader import Neo4jBatchLoader
from bracc_etl.transforms import (
    format_cnpj,
    format_cpf,
    normalize_name,
    parse_date,
    deduplicate_rows
)

class CNPJPipeline(Pipeline):
    def extract(self) -> pd.DataFrame:
        """Download and read CNPJ data from Receita Federal"""
        # Download logic
        
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Normalize and clean data"""
        df['cnpj'] = df['cnpj'].apply(format_cnpj)
        df['name'] = df['razao_social'].apply(normalize_name)
        df['start_date'] = df['data_inicio'].apply(parse_date)
        return deduplicate_rows(df)
        
    def load(self, df: pd.DataFrame, driver: Driver) -> None:
        """Write to Neo4j in batches"""
        loader = Neo4jBatchLoader(driver)
        loader.load_companies(df)

Common Transforms

Shared transformation utilities in etl/src/bracc_etl/transforms/:
# transforms/document_formatting.py
from bracc_etl.transforms import format_cnpj, format_cpf, strip_document

# CNPJ: 12345678000190 → 12.345.678/0001-90
formatted = format_cnpj("12345678000190")

# CPF: 12345678900 → 123.456.789-00 (masked in public mode)
formatted = format_cpf("12345678900")

# Remove formatting: 12.345.678/0001-90 → 12345678000190
stripped = strip_document("12.345.678/0001-90")

ETL Orchestration

From etl/src/bracc_etl/runner.py:
# Run single pipeline
python -m bracc_etl.runner --source cnpj

# Run multiple pipelines
python -m bracc_etl.runner --source cnpj,tse,transparencia

# Full orchestration (all 45+ sources)
make bootstrap-all
The bootstrap-all workflow:
  1. Loads source contract from config/bootstrap_all_contract.yml
  2. Runs pipelines in dependency order
  3. Continues on errors and classifies outcomes
  4. Generates audit reports in audit-results/bootstrap-all/

3. Graph Database Layer

Neo4j 5 Community Edition stores all data as a labeled property graph.

Schema Overview

From infra/neo4j/init.cypher, the schema defines:
Person - Individuals (CPF)
CREATE CONSTRAINT person_cpf_unique IF NOT EXISTS
  FOR (p:Person) REQUIRE p.cpf IS UNIQUE;
Company - Organizations (CNPJ)
CREATE CONSTRAINT company_cnpj_unique IF NOT EXISTS
  FOR (c:Company) REQUIRE c.cnpj IS UNIQUE;
Partner - Company partners/shareholders
CREATE CONSTRAINT partner_id_unique IF NOT EXISTS
  FOR (p:Partner) REQUIRE p.partner_id IS UNIQUE;

Graph Relationships

Common relationship types:
// Ownership and control
(Person)-[:PARTNER_OF]->(Company)
(Company)-[:OWNS]->(Company)  // Holdings
(Company)-[:HOLDING_GROUP]->(Holding)

// Transactions
(Company)-[:HAS_CONTRACT]->(Contract)
(Company)-[:SUBMITTED_BID]->(Bid)
(Company)-[:RECEIVED_FINANCE]->(Finance)

// Issues
(Company)-[:HAS_SANCTION]->(Sanction)
(Company)-[:HAS_EMBARGO]->(Embargo)
(Company)-[:UNDER_INVESTIGATION]->(Inquiry)

// Source attribution
(Node)-[:SOURCED_FROM]->(SourceDocument)

Query Examples

From api/src/bracc/queries/:
// queries/public_graph_company.cypher
MATCH (c:Company {cnpj: $company_identifier})
OPTIONAL MATCH path = (c)-[*1..3]-(related)
WHERE NOT related:Person OR $allow_person = true
RETURN c, collect(distinct related) as nodes,
       collect(distinct relationships(path)) as rels

4. Backend API Layer

FastAPI (Python 3.12+) provides async REST API with automatic OpenAPI documentation.

Project Structure

api/src/bracc/
├── main.py              # FastAPI app, middleware, lifespan
├── config.py            # Settings (Pydantic BaseSettings)
├── dependencies.py      # Neo4j driver, auth dependencies
├── routers/             # API endpoints
│   ├── public.py        # Public graph/pattern endpoints
│   ├── meta.py          # Health, stats, sources
│   ├── search.py        # Entity search
│   ├── graph.py         # Graph expansion
│   ├── entity.py        # Entity details
│   ├── patterns.py      # Pattern detection
│   ├── investigation.py # Investigation management
│   └── auth.py          # Authentication
├── services/            # Business logic
│   ├── neo4j_service.py            # Query execution
│   ├── public_guard.py             # Privacy enforcement
│   ├── intelligence_provider.py    # Pattern engine
│   └── source_registry.py          # Data source metadata
├── models/              # Pydantic models
│   ├── entity.py
│   ├── graph.py
│   ├── pattern.py
│   └── investigation.py
├── middleware/          # HTTP middleware
│   ├── security_headers.py
│   ├── cpf_masking.py
│   └── rate_limit.py
└── queries/             # Cypher query templates
    └── *.cypher

Key Endpoints

From api/src/bracc/routers/public.py and meta.py:
# public.py
@router.get("/api/v1/public/meta")
async def public_meta(session: AsyncSession) -> dict:
    """Aggregated metrics and source health"""
    return {
        "total_nodes": 40_000_000,
        "company_count": 60_000_000,
        "contract_count": 5_000_000,
        "source_health": {...}
    }

@router.get("/api/v1/public/graph/company/{company_ref}")
async def public_graph_for_company(
    company_ref: str,
    depth: int = 2
) -> GraphResponse:
    """Public company subgraph (CNPJ or ID)"""
    # Enforces public-safe defaults
    # Filters out Person nodes in public mode
    # Returns nodes, edges, and center_id

Privacy & Security

From api/src/bracc/main.py:
# Middleware stack (executed in reverse order)
app.add_middleware(CPFMaskingMiddleware)        # Mask CPF in responses
app.add_middleware(SecurityHeadersMiddleware)   # CSP, HSTS, etc.
app.add_middleware(SlowAPIMiddleware)           # Rate limiting
app.add_middleware(CORSMiddleware)              # CORS headers

# Public-safe defaults enforced in services/public_guard.py
def enforce_person_access_policy(labels: list[str]) -> None:
    """Block Person nodes in public mode"""
    if settings.public_mode and has_person_labels(labels):
        raise HTTPException(403, "Person data not accessible in public mode")

def sanitize_public_properties(props: dict) -> dict:
    """Remove CPF and other sensitive fields"""
    return {k: v for k, v in props.items() if k not in SENSITIVE_KEYS}
Environment flags:
PUBLIC_MODE=true                    # Enable public-safe mode
PUBLIC_ALLOW_PERSON=false           # Block Person node access
PUBLIC_ALLOW_ENTITY_LOOKUP=false    # Block direct entity lookup
PUBLIC_ALLOW_INVESTIGATIONS=false   # Block investigation features
PATTERNS_ENABLED=false              # Disable pattern engine

5. Frontend Layer

React 19 + TypeScript + Vite provides the web interface.

Tech Stack

From frontend/package.json:
{
  "dependencies": {
    "react": "^19.0.0",
    "react-dom": "^19.0.0",
    "react-router": "^7.0.0",
    "react-force-graph-2d": "^1.26.0",  // Graph visualization
    "react-hook-form": "^7.71.2",        // Form management
    "zustand": "^5.0.0",                  // State management
    "i18next": "^24.0.0",                 // Internationalization
    "lucide-react": "^0.575.0",          // Icons
    "zod": "^4.3.6"                       // Schema validation
  },
  "devDependencies": {
    "vite": "^6.0.0",
    "typescript": "^5.7.0",
    "vitest": "^3.0.0",
    "@testing-library/react": "^16.0.0"
  }
}

Key Features

  • Company Search - Search by CNPJ, name, or ID
  • Graph Visualization - Interactive force-directed graph with react-force-graph-2d
  • Entity Details - Comprehensive entity information panels
  • Investigation Workspace - Create and manage investigation collections
  • Pattern Highlighting - Visual indicators for detected patterns
  • Multi-language - Portuguese (pt-BR) and English (en) via i18next

Docker Compose Architecture

From docker-compose.yml:
services:
  neo4j:
    image: neo4j:5-community
    ports:
      - "7474:7474"  # Browser
      - "7687:7687"  # Bolt
    environment:
      NEO4J_AUTH: neo4j/${NEO4J_PASSWORD}
      NEO4J_PLUGINS: '["apoc"]'
    volumes:
      - neo4j-data:/data
      - ./infra/neo4j/init.cypher:/var/lib/neo4j/init.cypher
    healthcheck:
      test: cypher-shell -u neo4j -p $NEO4J_PASSWORD "RETURN 1"
      interval: 10s
      retries: 8

  api:
    build: ./api
    ports:
      - "8000:8000"
    environment:
      NEO4J_URI: bolt://neo4j:7687
      NEO4J_PASSWORD: ${NEO4J_PASSWORD}
    depends_on:
      neo4j:
        condition: service_healthy

  frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    environment:
      VITE_API_URL: http://localhost:8000
    depends_on:
      api:
        condition: service_healthy

  etl:
    build: ./etl
    profiles: ["etl"]  # Only starts with --profile etl
    volumes:
      - .:/workspace
    depends_on:
      neo4j:
        condition: service_healthy

Data Flow Example

Let’s trace a complete request flow:
1

User requests company graph

Frontend calls: GET /api/v1/public/graph/company/12345678000190?depth=2
2

API routes request

FastAPI router in routers/public.py receives request:
@router.get("/graph/company/{company_ref}")
async def public_graph_for_company(
    company_ref: str,
    depth: int = 2
) -> GraphResponse:
3

Resolve company identifier

Helper function _resolve_company() queries Neo4j:
MATCH (c:Company)
WHERE c.cnpj = $company_identifier
   OR elementId(c) = $company_id
RETURN elementId(c) as entity_id, labels(c) as labels
4

Execute graph query

Load query from queries/public_graph_company.cypher:
MATCH (c:Company) WHERE elementId(c) = $company_id
CALL apoc.path.subgraphAll(c, {
    maxLevel: $depth,
    relationshipFilter: ">"
})
YIELD nodes, relationships
RETURN nodes, relationships, $company_id as center_id
5

Enforce privacy policies

Filter results through public_guard.py:
  • Remove Person nodes if PUBLIC_MODE=true
  • Strip CPF fields from all nodes
  • Filter sensitive properties
6

Transform to response model

Convert Neo4j results to Pydantic models:
nodes = [GraphNode(
    id=node.element_id,
    label=node['razao_social'],
    type=node.labels[0],
    document_id=node['cnpj'],
    properties=sanitize_properties(node),
    sources=[SourceAttribution(database="CNPJ")]
) for node in raw_nodes]
7

Return JSON response

FastAPI serializes to JSON:
{
  "nodes": [{"id": "...", "label": "Acme Corp", ...}],
  "edges": [{"id": "...", "source": "...", "target": "..."}],
  "center_id": "4:abc123:0"
}
8

Frontend renders graph

React component uses react-force-graph-2d to visualize:
<ForceGraph2D
  graphData={{nodes, links: edges}}
  nodeLabel={node => node.label}
  nodeColor={node => getColorByType(node.type)}
  onNodeClick={handleNodeClick}
/>

Performance Considerations

Neo4j Optimization

For production scale (40M+ nodes):
NEO4J_HEAP_INITIAL=4G
NEO4J_HEAP_MAX=8G
NEO4J_PAGECACHE=12G
Requires 32GB+ RAM (64GB recommended)
  • All unique constraints create indexes automatically
  • Use elementId() for fast node lookup
  • Limit depth in graph traversals (max 3 recommended)
  • Use APOC procedures for complex graph operations
ETL uses Neo4jBatchLoader with:
  • Batch size: 1000-5000 nodes
  • UNWIND for bulk inserts
  • Periodic commits for large datasets

API Performance

  • Stats endpoint cached for 5 minutes
  • Neo4j connection pooling (default: 50 connections)
  • Response compression via FastAPI middleware
from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)

@app.get("/api/v1/public/graph/company/{ref}")
@limiter.limit("100/minute")
async def get_company_graph(...):
  • FastAPI with async/await throughout
  • Neo4j async driver (neo4j.AsyncDriver)
  • Concurrent query execution where possible

Security Architecture

Defense in Depth

  1. Environment-based Configuration
    • Public mode disables sensitive endpoints
    • Tier system (community vs enterprise)
    • Feature flags for experimental features
  2. Middleware Stack
    • Security headers (CSP, HSTS, X-Frame-Options)
    • CPF masking in all responses
    • Rate limiting per IP
    • CORS with explicit origin whitelist
  3. Data Access Control
    • Public-safe defaults block Person nodes
    • Property-level sanitization
    • Query-level enforcement (not relying on frontend)
  4. Authentication (when enabled)
    • JWT tokens with secure secret
    • Password hashing with bcrypt
    • Invite-code system for registration
  5. Audit Trail
    • All ingestion runs logged to graph
    • Source attribution on every node
    • Temporal violation tracking

Deployment Architectures

docker compose up -d
Single-machine setup:
  • Neo4j, API, Frontend in containers
  • Seed data for testing
  • Hot reload for development

Monitoring & Observability

Neo4j Metrics

  • Query performance via dbms.querylog
  • Memory usage
  • Transaction throughput
  • Cache hit rates

API Metrics

  • Request rate and latency
  • Error rates by endpoint
  • Rate limit hits
  • Health check status

ETL Metrics

  • Pipeline success/failure rates
  • Data quality metrics
  • Source availability
  • Ingestion duration

System Metrics

  • CPU and memory usage
  • Disk I/O
  • Network throughput
  • Docker container health

Next Steps

API Reference

Explore all available endpoints

ETL Deep Dive

Learn about data pipelines

Graph Schema

Complete Neo4j schema reference

Contributing

Contribute to the project

Build docs developers (and LLMs) love