The Tweet Audit Tool processes tweets in configurable batches to balance performance with recoverability. Understanding batch processing helps you optimize analysis speed and handle interruptions gracefully.
How batch processing works
Tweets are processed in sequential batches from your CSV file. After each batch completes, progress is saved to a checkpoint file.
for i in range(start_index, len(tweets), settings.batch_size):
batch = tweets[i : i + settings.batch_size]
batch_num = (i // settings.batch_size) + 1
total_batches = (len(tweets) + settings.batch_size - 1) // settings.batch_size
logger.info(
f"Processing batch {batch_num}/{total_batches} "
f"(tweets {i + 1}-{min(i + len(batch), len(tweets))} of {len(tweets)})"
)
for tweet in batch:
# ... process each tweet
checkpoint.save(i + len(batch))
logger.info(f"Checkpoint saved at index {i + len(batch)}")
Batch lifecycle
Load checkpoint
On startup, read data/checkpoint.txt to find the resume pointstart_index = checkpoint.load() # Returns 0 if no checkpoint exists
Create batch slice
Extract the next batch_size tweets starting from start_indexbatch = tweets[i : i + settings.batch_size]
Process tweets sequentially
Analyze each tweet in the batch, skipping retweetsfor tweet in batch:
if _is_retweet(tweet):
continue
result = self.analyzer.analyze(tweet)
Write results
Append flagged tweets to results.csv as they’re analyzedif result.decision == Decision.DELETE:
writer.write_result(result)
Save checkpoint
After the entire batch completes, save progresscheckpoint.save(i + len(batch))
Repeat
Move to the next batch until all tweets are processed
Configuration
Default batch size
The default batch size is 10 tweets per batch, configured in src/config.py:
@dataclass
class Settings:
# ... other settings
batch_size: int = 10
Changing batch size
To modify the batch size, edit src/config.py:
batch_size: int = 50 # Process 50 tweets per batch
Changing batch_size during an active analysis can cause unexpected resume behavior. Always complete or reset your analysis before changing this value.
Checkpointing system
Checkpoints enable reliable resume after interruptions from crashes, Ctrl+C, or API quota exhaustion.
The checkpoint file (data/checkpoint.txt) stores a single integer: the index of the next tweet to process.
This means tweets 0-49 have been processed, and processing will resume at index 50.
Checkpoint timing
Checkpoints are saved after each complete batch, not after each tweet. This means if you interrupt processing mid-batch, the entire batch will be re-processed on resume.
for i in range(start_index, len(tweets), settings.batch_size):
batch = tweets[i : i + settings.batch_size]
# Process all tweets in batch
for tweet in batch:
result = self.analyzer.analyze(tweet)
# ...
# Only save checkpoint after entire batch completes
checkpoint.save(i + len(batch))
Resume behavior
When you restart analysis, the tool automatically resumes from the saved checkpoint:
# First run (interrupted after batch 2)
$ python src/main.py analyze-tweets
Processing batch 1/100 (tweets 1-10 of 1000)
Checkpoint saved at index 10
Processing batch 2/100 (tweets 11-20 of 1000)
Checkpoint saved at index 20
Processing batch 3/100 (tweets 21-30 of 1000)
^C # User interrupts
# Second run (resumes from batch 3)
$ python src/main.py analyze-tweets
Resuming from tweet index 20 # Last completed checkpoint
Processing batch 3/100 (tweets 21-30 of 1000)
Retweets are skipped during processing but still count toward batch size and checkpoint indices. This is by design to maintain consistent indexing.
Choosing optimal batch size
Batch size affects three key factors:
| Factor | Small Batches (1-10) | Medium Batches (10-50) | Large Batches (50-100+) |
|---|
| Checkpoint frequency | Very frequent | Moderate | Infrequent |
| Resume precision | Minimal re-work | Some re-work | Significant re-work |
| Processing overhead | Higher (more I/O) | Balanced | Lower (less I/O) |
| Memory usage | Minimal | Low | Higher |
Recommendation by archive size
Small archives (< 1,000 tweets): Use batch_size=10 (default)
- Fast overall processing
- Checkpoint overhead is negligible
Medium archives (1,000-5,000 tweets): Use batch_size=25
- Good balance of speed and recoverability
- ~5 minutes of re-work if interrupted
Large archives (5,000+ tweets): Use batch_size=50
- Reduces checkpoint file I/O overhead
- Accept ~10 minutes of re-work on interruption
- Critical for multi-day processing with quota limits
Error handling
Batch processing stops immediately if any tweet fails analysis.
for tweet in batch:
if _is_retweet(tweet):
continue
try:
result = self.analyzer.analyze(tweet)
# ...
except Exception as e:
logger.error(
f"Failed to analyze tweet {tweet.id}: {e}", exc_info=True
)
return Result(
success=False,
count=analyzed_count,
error_type="analysis_failed",
error_message=str(e),
)
If analysis fails mid-batch, the checkpoint is not saved. The entire batch will be re-processed when you resume.
Recovery from errors
Identify the error
Check logs to understand why analysis failed:tail -n 50 logs/analysis.log
Fix the issue
Common fixes:
- Check internet connection
- Verify API key in
.env
- Increase
RATE_LIMIT_SECONDS if hitting quota
- Check for malformed tweets in CSV
Resume analysis
python src/main.py analyze-tweets
Processing resumes from the last successful checkpoint
Retweet handling
Retweets (starting with “RT @”) are automatically skipped but still affect batch indexing.
def _is_retweet(tweet) -> bool:
return tweet.content.startswith("RT @")
# In processing loop:
for tweet in batch:
if _is_retweet(tweet):
continue # Skip but don't decrement batch count
- Not your content: Retweets are others’ words, not yours
- Bulk deletion: Most users delete all retweets at once via Twitter’s UI
- API efficiency: Saves API quota for analyzing your original tweets
If you want to analyze retweets, remove the _is_retweet() check in src/application.py:95-96. Be aware this increases API costs and processing time.
Result writing
Results are written incrementally using append mode, not in batches.
with CSVWriter(settings.processed_results_path, append=True) as writer:
for i in range(start_index, len(tweets), settings.batch_size):
batch = tweets[i : i + settings.batch_size]
for tweet in batch:
# ...
if result.decision == Decision.DELETE:
writer.write_result(result) # Written immediately
File append behavior
- First run: Creates
data/tweets/processed/results.csv with headers
- Resume runs: Appends new results without duplicating headers
- Crash recovery: Already-written results are preserved
Even if analysis crashes mid-batch, any tweets flagged as DELETE before the crash are already saved in results.csv. Only unprocessed tweets in the batch need re-analysis.
Manual checkpoint management
Resetting progress
To start analysis from scratch:
rm data/checkpoint.txt
rm data/tweets/processed/results.csv
python src/main.py analyze-tweets
Skipping to specific position
To resume from a specific tweet index:
echo "500" > data/checkpoint.txt
python src/main.py analyze-tweets
This starts processing at tweet 500, skipping tweets 0-499.
Manually editing checkpoints can cause duplicate results in results.csv if you resume before an already-processed index. Only do this if you’ve also deleted the corresponding results.
I/O overhead
Each checkpoint write involves:
- Opening
data/checkpoint.txt
- Writing new index
- Closing file
For large archives, this overhead becomes significant:
10,000 tweets / batch_size=10 = 1,000 checkpoint writes
10,000 tweets / batch_size=50 = 200 checkpoint writes
Increasing batch size from 10 to 50 reduces checkpoint I/O by 80% for large archives.
Progress visibility
Smaller batches provide more granular progress logging:
# batch_size=10
Processing batch 1/1000 (tweets 1-10 of 10000)
Processing batch 2/1000 (tweets 11-20 of 10000)
# Updates every ~10 seconds
# batch_size=100
Processing batch 1/100 (tweets 1-100 of 10000)
Processing batch 2/100 (tweets 101-200 of 10000)
# Updates every ~100 seconds
Choose based on your preference for feedback frequency vs. I/O efficiency.
Monitoring progress
Track analysis progress in real-time:
# Watch checkpoint file
watch -n 5 cat data/checkpoint.txt
# Count processed results
wc -l data/tweets/processed/results.csv
# Calculate progress percentage
python -c "print(f'{int(open('data/checkpoint.txt').read()) / 10000 * 100:.1f}%')"
Best practices
- Use default batch size (10) unless you have a specific reason to change it
- Increase batch size for archives > 5,000 tweets to reduce I/O overhead
- Never edit
checkpoint.txt or results.csv while analysis is running
- Back up checkpoint before manual edits:
cp data/checkpoint.txt data/checkpoint.txt.bak
- Monitor logs to understand batch processing patterns and timing
- Plan for interruptions by choosing a batch size that balances speed with acceptable re-work time
Troubleshooting
Duplicate results in CSV
Symptom: Same tweet URL appears multiple times in results.csv
Cause: Manually reset checkpoint without deleting results file
Solution:
rm data/tweets/processed/results.csv
rm data/checkpoint.txt
python src/main.py analyze-tweets
Analysis seems stuck
Symptom: No log output for several minutes
Cause: Large batch size with slow API responses
Solution: Wait for batch to complete. Checkpoints only save after full batch. Press Ctrl+C to interrupt, then resume.
Checkpoint not saving
Symptom: After resume, processing starts from 0 instead of last position
Cause: Write permissions issue on data/checkpoint.txt
Solution:
chmod 600 data/checkpoint.txt
ls -la data/checkpoint.txt # Verify permissions