Skip to main content
Jarbas populates its PostgreSQL database through Django management commands. Each command reads a dataset file and writes it to the appropriate model.
All data loading commands write directly to PostgreSQL via Django ORM — they do not require a Celery worker to be running. RabbitMQ and Celery are needed for the web application’s background tasks (such as receipt URL fetching), not for these import commands.

Sample data files

The repository ships with small sample files in contrib/data/ so you can seed a local database without running Rosie:
FileContents
contrib/data/reimbursements_sample.csvSample reimbursement records (CSV)
contrib/data/companies_sample.xzSample company records (LZMA-compressed)
contrib/data/suspicions_sample.xzSample suspicion scores from Rosie (LZMA-compressed)
For Docker, these files are mounted at /mnt/data/ inside the container.

Commands

reimbursements

Loads the reimbursement dataset from a CSV file. This is always the first import to run — the other commands require reimbursement records to already exist.
python manage.py reimbursements <path>
# Docker
docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements_sample.csv
Records are inserted in batches of 4096 rows. Pass --batch-size to override:
python manage.py reimbursements reimbursements.csv --batch-size 2048
Full reimbursement datasets are produced by Rosie or can be downloaded with the serenata-toolbox. Files are named reimbursements-YYYY.csv by year.

companies

Loads Brazilian company data from an LZMA-compressed file. Companies are referenced by their CNPJ in reimbursement records.
python manage.py companies <path>
# Docker
docker-compose run --rm django python manage.py companies /mnt/data/companies_sample.xz

suspicions

Loads Rosie’s suspicion output. This command matches each row in the suspicions file to an existing Reimbursement record by document_id and updates its suspicions (JSON) and probability (decimal) fields.
python manage.py suspicions <path>
# Docker
docker-compose run --rm django python manage.py suspicions /mnt/data/suspicions_sample.xz
The file must be LZMA-compressed (.xz). This is the direct output of a Rosie analysis run. Options:
FlagDefaultDescription
--batch-size / -b4096Rows per processing batch.
--workers / -w8Thread pool size for parallel updates.

tweets

Fetches tweets from the @RosieDaSerenata timeline and links each tweet to the reimbursement it references. Tweet links appear in the dashboard alongside flagged reimbursements.
python manage.py tweets
# Docker
docker-compose run --rm django python manage.py tweets
This command requires valid Twitter API credentials in your .env file (TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_SECRET). Without them the command exits with a warning and does nothing. Obtain credentials from the python-twitter getting started guide.

searchvector

Builds PostgreSQL full-text search vectors for all reimbursement records. The dashboard search bar will not return results until this command has been run.
python manage.py searchvector
# Docker
docker-compose run --rm django python manage.py searchvector
The search index covers the following fields with weighted relevance:
WeightFields
A (highest)congressperson_name, supplier, cnpj_cpf, party
Bstate, receipt_text
Cpassenger, leg_of_the_trip
D (lowest)subquota_description, subquota_group_description
By default, only records with a NULL search vector are indexed. Pass --all to rebuild every record:
python manage.py searchvector --all
Options:
FlagDefaultDescription
--alloffRe-create search vectors for all records, not just new ones.
--batch-size / -b4096Records updated per database query.
--silentoffSuppress progress output.

update

A composite command that performs a full database refresh in one step. Useful for updating production after Rosie generates a new dataset.
python manage.py update <directory>
The command expects a directory containing reimbursements-YYYY.csv files and a suspicions.xz file. It runs the following steps in order:
1

Back up tweet links

Exports all Tweet records to tweets.csv in the target directory so they can be re-linked after the data is replaced.
2

Drop and reload reimbursements

Deletes all Reimbursement rows, then imports every reimbursements-*.csv file found in the directory.
3

Load suspicions

Imports suspicions.xz from the directory.
4

Reload receipt texts

Downloads 2017-02-15-receipts-texts.xz from the Serenata data space and imports it.
5

Rebuild the search vector

Runs searchvector to re-index all records.
6

Restore tweet links

Re-creates Tweet records from the backup CSV.

Automated search vector scheduling

You can schedule searchvector to run automatically using Celery Beat. Set the following variables in .env:
VariableDefaultDescription
SCHEDULE_SEARCHVECTORFalseSet to True to enable the scheduled task.
SCHEDULE_SEARCHVECTOR_CRON_MINUTE0Cron minute expression.
SCHEDULE_SEARCHVECTOR_CRON_HOUR2Cron hour expression.
SCHEDULE_SEARCHVECTOR_CRON_DAY_OF_WEEK*Cron day-of-week expression.
SCHEDULE_SEARCHVECTOR_CRON_DAY_OF_MONTH*/2Cron day-of-month expression (default: every two days).
SCHEDULE_SEARCHVECTOR_CRON_MONTH_OF_YEAR*Cron month-of-year expression.
The beat container in docker-compose.yml runs celery beat and will pick up the schedule automatically when SCHEDULE_SEARCHVECTOR=True.

Typical seeding sequence

Run these commands in order when setting up a fresh database:
# 1. Migrate
python manage.py migrate

# 2. Load core data
python manage.py reimbursements /mnt/data/reimbursements_sample.csv
python manage.py companies /mnt/data/companies_sample.xz
python manage.py suspicions /mnt/data/suspicions_sample.xz

# 3. Enrich with social data
python manage.py tweets

# 4. Build search index (must come last)
python manage.py searchvector

Build docs developers (and LLMs) love