Learn how the checkpoint system works and how to safely resume analysis after interruptions or errors
The Tweet Audit Tool includes an automatic checkpoint system that saves progress after each batch. This allows you to safely interrupt and resume analysis without reprocessing tweets.
From storage.py:84-129, the Checkpoint class manages state:
class Checkpoint: def __init__(self, file_path: str) -> None: self.path = file_path self.file = None def load(self) -> int: """Load checkpoint position, returns 0 if file doesn't exist""" if not self.file: raise RuntimeError("Checkpoint file is not open") self.file.seek(0) content = self.file.read().strip() if not content: return 0 try: return int(content) except ValueError as e: raise ValueError( f"Corrupted checkpoint file {self.path}: expected integer, got '{content}'" ) from e def save(self, tweet_index: int) -> None: """Save current position to checkpoint file""" if not self.file: raise RuntimeError("Checkpoint file is not open") self.file.seek(0) self.file.truncate() self.file.write(str(tweet_index)) self.file.flush()
Checkpoints are saved after every batch completes. From application.py:77-117:
with Checkpoint(settings.checkpoint_path) as checkpoint: start_index = checkpoint.load() logger.info(f"Resuming from tweet index {start_index}") with CSVWriter(settings.processed_results_path, append=True) as writer: for i in range(start_index, len(tweets), settings.batch_size): batch = tweets[i : i + settings.batch_size] # ... process batch ... # Checkpoint saved after each batch checkpoint.save(i + len(batch)) logger.info(f"Checkpoint saved at index {i + len(batch)}")
With the default batch size of 10, checkpoints are saved every 10 tweets.
The results CSV is opened in append mode during resume:
with CSVWriter(settings.processed_results_path, append=True) as writer:
This means:
✅ Previous results are preserved
✅ New results are added to the end
✅ No duplicates (each tweet processed once)
✅ Header written only if file doesn’t exist
From storage.py:172-181:
def write_result(self, result: AnalysisResult) -> None: if not self.writer: raise RuntimeError("CSVWriter is not open") if not self.header_written: self.writer.writerow([RESULT_CSV_URL_COLUMN, RESULT_CSV_DELETED_COLUMN]) self.header_written = True self.writer.writerow([result.tweet_url, CSV_BOOL_FALSE]) self.file.flush()
Wait for “Checkpoint saved” log before interrupting
Don’t force-kill the process mid-batch
Use Ctrl+C for graceful shutdown
# ✅ Good: Wait for log2024-01-15 12:30:10 - application - INFO - Checkpoint saved at index 120# Now safe to press Ctrl+C# ❌ Bad: Kill immediatelykill -9 <pid> # May lose batch progress
Monitor progress regularly
Check checkpoint periodically during long runs:
# In another terminal while analysis runswatch -n 30 'cat data/checkpoint.txt'# Or check progress percentagewatch -n 30 'echo "scale=1; $(cat data/checkpoint.txt) * 100 / 1523" | bc'
Backup before manual edits
Before modifying checkpoint or results:
# Backup everythingcp data/checkpoint.txt checkpoint-backup.txtcp data/tweets/processed/results.csv results-backup.csv# Now safe to experimentecho "500" > data/checkpoint.txt
Use appropriate batch sizes
Balance checkpoint frequency vs. performance:
# In config.py:batch_size: int = 10 # More frequent checkpoints# vsbatch_size: int = 50 # Less frequent checkpoints
Trade-offs:
Smaller batches: More checkpoints, safer resume, slower processing
# Day 1: Process first 500 tweetspython src/main.py analyze-tweets# Let it run to ~500, then Ctrl+C# Day 2: Continue from 500python src/main.py analyze-tweets# Process another 500# Day 3: Finish remaining tweetspython src/main.py analyze-tweets
The checkpoint system does NOT support parallel processing. Running multiple analysis processes simultaneously will cause conflicts.
If you need faster processing:
Split your CSV into multiple files manually
Create separate project directories for each
Run separate analysis processes
Combine results afterward
# Don't do this (will conflict):python src/main.py analyze-tweets & # Process 1python src/main.py analyze-tweets & # Process 2 (conflicts!)# Instead, split manually:head -n 500 tweets.csv > tweets-part1.csvtail -n +501 tweets.csv > tweets-part2.csv# Process each in separate directory