suspicions.xz.
System overview
Rosie subsystem
Rosie is a pure Python application with no persistent service. It is run on demand and exits after writingsuspicions.xz.
Module structure
Core pipeline class
rosie/rosie/core/__init__.py implements the Core class, which is the generic analysis engine:
.pkl files via joblib so subsequent runs skip the fit step. MonthlySubquotaLimitClassifier is exempt from caching because its serialized size exceeds joblib’s practical limits.
Adapter
TheAdapter (rosie/rosie/chamber_of_deputies/adapter.py) is responsible for:
- Calling
serenata-toolboxto download companies and per-year reimbursement CSVs - Merging the two datasets on
cnpj_cpf/cnpj - Normalizing column names, document types, and date fields
Classifiers
Each classifier lives inrosie/rosie/chamber_of_deputies/classifiers/ and inherits from scikit-learn’s TransformerMixin. The full set configured in settings.py:
| Key in suspicions.xz | Class |
|---|---|
meal_price_outlier | MealPriceOutlierClassifier |
over_monthly_subquota_limit | MonthlySubquotaLimitClassifier |
suspicious_traveled_speed_day | TraveledSpeedsClassifier |
invalid_cnpj_cpf | InvalidCnpjCpfClassifier |
election_expenses | ElectionExpensesClassifier |
irregular_companies_classifier | IrregularCompaniesClassifier |
InvalidCnpjCpfClassifier lives in rosie/rosie/core/classifiers/ because it is shared between the Chamber of Deputies and Federal Senate modules.
External dependency: serenata-toolbox
Rosie depends onserenata-toolbox (a separate pip package at github.com/okfn-brasil/serenata-toolbox) for all I/O against government data sources. Rosie itself contains no direct HTTP calls to government APIs.
Jarbas subsystem
Jarbas is a long-running web application. It serves a REST API and an Elm-compiled frontend dashboard, backed by PostgreSQL and Celery/RabbitMQ for background task processing.Components
Django + DRF
The web layer. Django REST Framework handles all API serialization and filtering. Gunicorn serves the WSGI application. URL routing is defined in
jarbas/urls.py.Elm frontend
The dashboard UI is compiled from Elm source. Static assets are generated by Node.js (
npm run assets) and collected via Django’s collectstatic.Celery workers
Background task workers (
tasks service) process long-running import jobs dispatched by management commands. A beat service handles scheduled tasks such as the search vector refresh.PostgreSQL
The primary datastore. Jarbas requires PostgreSQL specifically because it uses Django’s
JSONField and full-text search via SearchVector.URL structure
Defined injarbas/urls.py:
/api/chamber_of_deputies/ prefix hosts the reimbursement, subquota, applicant, and company endpoints documented in the API reference.
Data loading via management commands
Jarbas does not pull data directly from Rosie. Data is loaded manually using Django management commands:Docker Compose topology
Thedocker-compose.yml defines seven services:
- Application services
- Infrastructure services
- Analysis service
| Service | Image | Role |
|---|---|---|
django | serenata/django | Gunicorn WSGI server for the API and dashboard |
tasks | serenata/django | Celery worker — processes import task queues |
beat | serenata/django | Celery beat — runs scheduled jobs (e.g., search vector refresh) |
elm | serenata/elm | Compiles and serves Elm frontend assets |
Dependency graph
rosie has no depends_on entries — it runs independently and shares a data volume (/tmp/serenata-data) with the host.
Running Rosie via Docker
--rm flag removes the container after it exits. The volume mount makes suspicions.xz available on the host for subsequent import into Jarbas.
Data flow summary
PostgreSQL is the only mandatory database for Jarbas. The
JSONField usage and SearchVector full-text search are PostgreSQL-specific features — other databases are not supported.