Overview
The tool processes data in two distinct pipelines:Extract pipeline
Transforms Twitter’s complex JSON archive into a simplified CSV format.Input: Twitter archive JSON
Location:data/tweets/tweets.json
Format:
tweets.json
Twitter archives contain 50+ fields per tweet. We only extract what we need:
id_str and full_text.Step 1: Parse JSON
Component:JSONParser (storage.py:35)
- Load entire JSON file into memory
- Iterate through array of tweet objects
- Extract
id_strandfull_textfrom eachtweetobject - Create immutable
Tweetobjects
List[Tweet]
Step 2: Write to CSV
Component:CSVWriter (storage.py:131)
- Create output directory if needed
- Open CSV file for writing
- Write header row:
id,text - Write one row per tweet
- Close file (automatic via context manager)
Output: Transformed CSV
Location:data/tweets/transformed/tweets.csv
Format:
tweets.csv
Why this transformation?
- Simpler format (2 columns vs 50+ fields)
- Human-readable (can inspect in Excel)
- Faster to parse for analysis
- Smaller file size
Visual flow
Analysis pipeline
Processes tweets through Gemini AI and identifies deletion candidates.Input: Transformed CSV
Location:data/tweets/transformed/tweets.csv
Same format as extract pipeline output.
Step 1: Load tweets and checkpoint
Component:CSVParser + Checkpoint (storage.py:61, storage.py:84)
- Parse entire CSV into
List[Tweet] - Load checkpoint (0 if first run, or saved index)
- Skip already-processed tweets
Step 2: Batch processing loop
Component:Application.analyze_tweets() (application.py:64)
- Batch 1
- Batch 2
- Interruption
- Resume
Tweets 0-9 (batch_size=10)
Step 3: Filter retweets
Component:_is_retweet() (application.py:125)
Step 4: AI analysis
Component:Gemini.analyze() (analyzer.py:71)
API Request:
Step 5: Write results
Component:CSVWriter.write_result() (storage.py:172)
Only DELETE decisions are written to results. KEEP decisions are silently skipped. This keeps the output focused on actionable items.
- Check if decision is DELETE
- If yes, write row to results CSV
- Flush to disk immediately (ensures no data loss)
- If no, skip (don’t write KEEP decisions)
Step 6: Update checkpoint
Component:Checkpoint.save() (storage.py:121)
- Seek to beginning of checkpoint file
- Truncate (clear existing content)
- Write new index
- Flush to disk
Output: Results CSV
Location:data/tweets/processed/results.csv
Format:
results.csv
tweet_url: Direct link to tweet (clickable in spreadsheets)deleted: Manual tracking column (initialized tofalse)
Why include 'deleted' column?
Why include 'deleted' column?
Allows users to manually track deletion progress:
- Open results.csv in Excel/Google Sheets
- Click tweet URL to review
- Delete tweet on Twitter
- Mark
deletedastruein spreadsheet - Track completion progress
Visual flow
Data transformations
Summary of how data structure changes through the pipeline:- Twitter JSON
- Python objects
- Transformed CSV
- Results CSV
Input format from TwitterSize: ~5-10 KB per tweet (large)
Error handling flow
How errors propagate through the system: Examples:Performance characteristics
Time analysis
For 5,000 tweets with default settings:Memory usage
Extract phase
Extract phase
Memory: O(n) where n = number of tweetsFor 50,000 tweets × 200 bytes = ~10 MB
Analysis phase
Analysis phase
Memory: O(batch_size)For batch_size=10 × 200 bytes = ~2 KB
Next steps
Component details
Deep dive into each module’s implementation
Design decisions
Understand why the system works this way