Data loading

Jarbas populates its PostgreSQL database through Django management commands. Each command reads a dataset file and writes it to the appropriate model.

All data loading commands write directly to PostgreSQL via Django ORM — they do not require a Celery worker to be running. RabbitMQ and Celery are needed for the web application’s background tasks (such as receipt URL fetching), not for these import commands.

Sample data files

The repository ships with small sample files in contrib/data/ so you can seed a local database without running Rosie:

File	Contents
`contrib/data/reimbursements_sample.csv`	Sample reimbursement records (CSV)
`contrib/data/companies_sample.xz`	Sample company records (LZMA-compressed)
`contrib/data/suspicions_sample.xz`	Sample suspicion scores from Rosie (LZMA-compressed)

For Docker, these files are mounted at /mnt/data/ inside the container.

Commands

`reimbursements`

Loads the reimbursement dataset from a CSV file. This is always the first import to run — the other commands require reimbursement records to already exist.

python manage.py reimbursements <path>

# Docker
docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements_sample.csv

Records are inserted in batches of 4096 rows. Pass --batch-size to override:

python manage.py reimbursements reimbursements.csv --batch-size 2048

Full reimbursement datasets are produced by Rosie or can be downloaded with the serenata-toolbox. Files are named reimbursements-YYYY.csv by year.

`companies`

Loads Brazilian company data from an LZMA-compressed file. Companies are referenced by their CNPJ in reimbursement records.

python manage.py companies <path>

# Docker
docker-compose run --rm django python manage.py companies /mnt/data/companies_sample.xz

`suspicions`

Loads Rosie’s suspicion output. This command matches each row in the suspicions file to an existing Reimbursement record by document_id and updates its suspicions (JSON) and probability (decimal) fields.

python manage.py suspicions <path>

# Docker
docker-compose run --rm django python manage.py suspicions /mnt/data/suspicions_sample.xz

The file must be LZMA-compressed (.xz). This is the direct output of a Rosie analysis run. Options:

Flag	Default	Description
`--batch-size` / `-b`	`4096`	Rows per processing batch.
`--workers` / `-w`	`8`	Thread pool size for parallel updates.

`tweets`

Fetches tweets from the @RosieDaSerenata timeline and links each tweet to the reimbursement it references. Tweet links appear in the dashboard alongside flagged reimbursements.

python manage.py tweets

# Docker
docker-compose run --rm django python manage.py tweets

This command requires valid Twitter API credentials in your .env file (TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_SECRET). Without them the command exits with a warning and does nothing. Obtain credentials from the python-twitter getting started guide.

`searchvector`

Builds PostgreSQL full-text search vectors for all reimbursement records. The dashboard search bar will not return results until this command has been run.

python manage.py searchvector

# Docker
docker-compose run --rm django python manage.py searchvector

The search index covers the following fields with weighted relevance:

Weight	Fields
A (highest)	`congressperson_name`, `supplier`, `cnpj_cpf`, `party`
B	`state`, `receipt_text`
C	`passenger`, `leg_of_the_trip`
D (lowest)	`subquota_description`, `subquota_group_description`

By default, only records with a NULL search vector are indexed. Pass --all to rebuild every record:

python manage.py searchvector --all

Options:

Flag	Default	Description
`--all`	off	Re-create search vectors for all records, not just new ones.
`--batch-size` / `-b`	`4096`	Records updated per database query.
`--silent`	off	Suppress progress output.

`update`

A composite command that performs a full database refresh in one step. Useful for updating production after Rosie generates a new dataset.

python manage.py update <directory>

The command expects a directory containing reimbursements-YYYY.csv files and a suspicions.xz file. It runs the following steps in order:

Back up tweet links

Exports all Tweet records to tweets.csv in the target directory so they can be re-linked after the data is replaced.

Drop and reload reimbursements

Deletes all Reimbursement rows, then imports every reimbursements-*.csv file found in the directory.

Load suspicions

Imports suspicions.xz from the directory.

Reload receipt texts

Downloads 2017-02-15-receipts-texts.xz from the Serenata data space and imports it.

Rebuild the search vector

Runs searchvector to re-index all records.

Restore tweet links

Re-creates Tweet records from the backup CSV.

Automated search vector scheduling

You can schedule searchvector to run automatically using Celery Beat. Set the following variables in .env:

Variable	Default	Description
`SCHEDULE_SEARCHVECTOR`	`False`	Set to `True` to enable the scheduled task.
`SCHEDULE_SEARCHVECTOR_CRON_MINUTE`	`0`	Cron minute expression.
`SCHEDULE_SEARCHVECTOR_CRON_HOUR`	`2`	Cron hour expression.
`SCHEDULE_SEARCHVECTOR_CRON_DAY_OF_WEEK`	`*`	Cron day-of-week expression.
`SCHEDULE_SEARCHVECTOR_CRON_DAY_OF_MONTH`	`*/2`	Cron day-of-month expression (default: every two days).
`SCHEDULE_SEARCHVECTOR_CRON_MONTH_OF_YEAR`	`*`	Cron month-of-year expression.

The beat container in docker-compose.yml runs celery beat and will pick up the schedule automatically when SCHEDULE_SEARCHVECTOR=True.

Typical seeding sequence

Run these commands in order when setting up a fresh database:

# 1. Migrate
python manage.py migrate

# 2. Load core data
python manage.py reimbursements /mnt/data/reimbursements_sample.csv
python manage.py companies /mnt/data/companies_sample.xz
python manage.py suspicions /mnt/data/suspicions_sample.xz

# 3. Enrich with social data
python manage.py tweets

# 4. Build search index (must come last)
python manage.py searchvector

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

Sample data files

Commands

`reimbursements`

`companies`

`suspicions`

`tweets`

`searchvector`

`update`

Automated search vector scheduling

Typical seeding sequence

Build docs developers (and LLMs) love

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

​Sample data files

​Commands

​reimbursements

​companies

​suspicions

​tweets

​searchvector

​update

​Automated search vector scheduling

​Typical seeding sequence

Build docs developers (and LLMs) love

Sample data files

Commands

`reimbursements`

`companies`

`suspicions`

`tweets`

`searchvector`

`update`

Automated search vector scheduling

Typical seeding sequence