Skip to main content
Rosie reads reimbursements filed under Brazil’s Quota for Exercising Parliamentary Activity (CEAP) and applies a pipeline of classifiers to detect irregularities. When a reimbursement looks suspicious, she records why and posts about it on Twitter as @RosieDaSerenata.

How Rosie Works

1

Fetch datasets

Rosie downloads reimbursement CSVs and the companies dataset from public government sources via the serenata-toolbox. Data is fetched back to the year 2009.
2

Merge and normalize

The Adapter class merges reimbursements with company registration data (left-joined on CNPJ), renames columns to the Serenata standard, coerces dates, and normalizes category labels.
3

Run classifiers

The Core engine iterates over every classifier defined in the module’s settings.py. Each classifier receives the full dataset and returns a prediction per row: suspicious (True / -1) or normal (False / 1).
4

Write suspicions

Results are collected into a suspicions DataFrame keyed by unique identifiers and written to a compressed CSV at /tmp/serenata-data/suspicions.xz.

Modules

Rosie covers two legislative bodies, each with its own adapter and classifier settings:

Chamber of Deputies

The primary module. Uses six classifiers covering meal prices, travel speeds, election expenses, irregular companies, monthly subquota limits, and invalid tax IDs.Unique IDs: applicant_id, year, document_id

Federal Senate

A lighter module that currently runs the InvalidCnpjCpfClassifier against Senate reimbursements. Document types are normalised to unknown since the Senate data does not include a document type column.Unique IDs: none (full dataset is kept)

Output

After a run finishes, Rosie writes a single compressed file:
/tmp/serenata-data/suspicions.xz
This is a UTF-8 CSV compressed with xz. Each row represents one reimbursement and each classifier column contains True (suspicious) or False (normal).

Key Dependencies

PackageRole
scikit-learnClassifier base classes (TransformerMixin) and KMeans clustering
pandasDataFrame manipulation and CSV I/O
numpyNumerical operations and array helpers
geopyGeodesic distance calculation (Vincenty formula) for the travel speed classifier
brutilsBrazilian CPF and CNPJ validation
serenata-toolboxFetches reimbursement and company datasets from government sources
docoptCLI argument parsing for rosie.py
scipyScientific computing (must be installed before scikit-learn)
scipy must appear before scikit-learn in requirements.txt so the wheel builds correctly inside Docker.

Build docs developers (and LLMs) love