research/src/ directory contains Python scripts that collect and prepare the open data used by Rosie and Jarbas. Each script is focused on a single data source or transformation step.
Dependencies
Scripts share a common set of libraries declared inresearch/requirements.txt:
serenata-toolbox package is a pip-installable library that handles dataset versioning and remote storage. It is used across multiple scripts for uploading and retrieving datasets.
Reimbursement data
fetch_receipts.py
Downloads PDF receipt images from the Lower House server for every reimbursement in the local datasets.
- Scans all
.xzdataset files inresearch/data/for receipt URLs. - Downloads each PDF to
<target_directory>/<applicant_id>/<year>/<document_id>.pdf. - Skips receipts that have already been downloaded.
- Uses a multiprocessing pool (4 workers by default) for parallel downloads.
- Prints a progress report with total size, skipped files, and download errors.
group_receipts.py
Merges the separate year-based reimbursement datasets (current-year, last-year, previous-years) into a single consolidated reimbursements.xz file.
The output file is date-stamped (e.g., 2024-01-15-reimbursements.xz) and stored in research/data/.
Company and supplier data
fetch_cnpj_info.py
Fetches company registration details from the Brazilian federal revenue service (Receita Federal) for every CNPJ that appears in the reimbursement datasets.
- Accepts one or more source dataset files as positional arguments.
-t/--threads: number of concurrent fetch threads (default: 10).-p/--proxies: optional list of HTTP proxy addresses to distribute requests.- Saves progress incrementally to
data/companies-partial.xzevery 100 records to avoid data loss on interruption. - Translates Brazilian Portuguese column names to English and decomposes nested fields (main activity, secondary activities, partners list).
geocode_addresses.py
Adds latitude/longitude coordinates to each company in data/companies.xz using the Google Maps Geocoding API.
- Reads company addresses from the
companies.xzdataset. - Calls the Google Maps Geocoder via
geopywith up to 40 concurrent threads. - Caches results as pickle files in
data/companies/so interrupted runs can resume. - Writes the final dataset back to
data/companies.xzand removes the temporary cache directory.
Set
GOOGLE_API_KEY in your .env file before running this script.Congressperson data
fetch_congressperson_details.py
Fetches civil name, birth date, and gender for every congressperson from the Chamber of Deputies web service.
- Collects all unique
congressperson_idvalues across the current-year, last-year, and previous-years datasets. - Calls
ObterDetalhesDeputadoon the Chamber’s SOAP endpoint for each ID. - Outputs a date-stamped file (e.g.,
2024-01-15-congressperson-details.xz) toresearch/data/.
Campaign finance data
fetch_campaign_donations.py
Downloads campaign donation reports for federal elections from the Brazilian Electoral Court (TSE).
- Covers election years 2010, 2012, 2014, and 2016.
- Downloads ZIP archives from
agencia.tse.jus.brfor candidates, parties, and committees. - Extracts and merges the text files into a single normalized dataset.
fetch_tse_data.py
Fetches additional electoral data from the TSE open data portal, complementing the campaign donation data with candidate and election metadata.
Sanctions data
fetch_federal_sanctions.py
Downloads the federal government sanctions lists published by the Office of the Comptroller General (CGU).
- Retrieves the most recently published version of each sanctions list by walking backwards from today’s date until a valid file is found.
- Downloads and extracts ZIP archives for each dataset.
- Translates column names from Portuguese to English.
- Removes temporary files after processing.
Venue verification data
fetch_foursquare_info.py
Queries the Foursquare Places API for venue information at supplier addresses, used to verify whether a claimed expense location is consistent with the supplier category.
fetch_yelp_info.py
Queries the Yelp Fusion API for business information at supplier addresses, providing an independent venue verification source alongside the Foursquare data.
Utilities
backup_data.py
Uploads all locally generated datasets to the remote dataset repository via serenata-toolbox:
