All data loading commands write directly to PostgreSQL via Django ORM — they do not require a Celery worker to be running. RabbitMQ and Celery are needed for the web application’s background tasks (such as receipt URL fetching), not for these import commands.
Sample data files
The repository ships with small sample files incontrib/data/ so you can seed a local database without running Rosie:
| File | Contents |
|---|---|
contrib/data/reimbursements_sample.csv | Sample reimbursement records (CSV) |
contrib/data/companies_sample.xz | Sample company records (LZMA-compressed) |
contrib/data/suspicions_sample.xz | Sample suspicion scores from Rosie (LZMA-compressed) |
/mnt/data/ inside the container.
Commands
reimbursements
Loads the reimbursement dataset from a CSV file. This is always the first import to run — the other commands require reimbursement records to already exist.
--batch-size to override:
companies
Loads Brazilian company data from an LZMA-compressed file. Companies are referenced by their CNPJ in reimbursement records.
suspicions
Loads Rosie’s suspicion output. This command matches each row in the suspicions file to an existing Reimbursement record by document_id and updates its suspicions (JSON) and probability (decimal) fields.
.xz). This is the direct output of a Rosie analysis run.
Options:
| Flag | Default | Description |
|---|---|---|
--batch-size / -b | 4096 | Rows per processing batch. |
--workers / -w | 8 | Thread pool size for parallel updates. |
tweets
Fetches tweets from the @RosieDaSerenata timeline and links each tweet to the reimbursement it references. Tweet links appear in the dashboard alongside flagged reimbursements.
This command requires valid Twitter API credentials in your
.env file (TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_SECRET). Without them the command exits with a warning and does nothing. Obtain credentials from the python-twitter getting started guide.searchvector
Builds PostgreSQL full-text search vectors for all reimbursement records. The dashboard search bar will not return results until this command has been run.
| Weight | Fields |
|---|---|
| A (highest) | congressperson_name, supplier, cnpj_cpf, party |
| B | state, receipt_text |
| C | passenger, leg_of_the_trip |
| D (lowest) | subquota_description, subquota_group_description |
NULL search vector are indexed. Pass --all to rebuild every record:
| Flag | Default | Description |
|---|---|---|
--all | off | Re-create search vectors for all records, not just new ones. |
--batch-size / -b | 4096 | Records updated per database query. |
--silent | off | Suppress progress output. |
update
A composite command that performs a full database refresh in one step. Useful for updating production after Rosie generates a new dataset.
reimbursements-YYYY.csv files and a suspicions.xz file. It runs the following steps in order:
Back up tweet links
Exports all
Tweet records to tweets.csv in the target directory so they can be re-linked after the data is replaced.Drop and reload reimbursements
Deletes all
Reimbursement rows, then imports every reimbursements-*.csv file found in the directory.Reload receipt texts
Downloads
2017-02-15-receipts-texts.xz from the Serenata data space and imports it.Automated search vector scheduling
You can schedulesearchvector to run automatically using Celery Beat. Set the following variables in .env:
| Variable | Default | Description |
|---|---|---|
SCHEDULE_SEARCHVECTOR | False | Set to True to enable the scheduled task. |
SCHEDULE_SEARCHVECTOR_CRON_MINUTE | 0 | Cron minute expression. |
SCHEDULE_SEARCHVECTOR_CRON_HOUR | 2 | Cron hour expression. |
SCHEDULE_SEARCHVECTOR_CRON_DAY_OF_WEEK | * | Cron day-of-week expression. |
SCHEDULE_SEARCHVECTOR_CRON_DAY_OF_MONTH | */2 | Cron day-of-month expression (default: every two days). |
SCHEDULE_SEARCHVECTOR_CRON_MONTH_OF_YEAR | * | Cron month-of-year expression. |
beat container in docker-compose.yml runs celery beat and will pick up the schedule automatically when SCHEDULE_SEARCHVECTOR=True.
