Batch processing

The Tweet Audit Tool processes tweets in configurable batches to balance performance with recoverability. Understanding batch processing helps you optimize analysis speed and handle interruptions gracefully.

How batch processing works

Tweets are processed in sequential batches from your CSV file. After each batch completes, progress is saved to a checkpoint file.

src/application.py

for i in range(start_index, len(tweets), settings.batch_size):
    batch = tweets[i : i + settings.batch_size]
    batch_num = (i // settings.batch_size) + 1
    total_batches = (len(tweets) + settings.batch_size - 1) // settings.batch_size
    
    logger.info(
        f"Processing batch {batch_num}/{total_batches} "
        f"(tweets {i + 1}-{min(i + len(batch), len(tweets))} of {len(tweets)})"
    )
    
    for tweet in batch:
        # ... process each tweet
    
    checkpoint.save(i + len(batch))
    logger.info(f"Checkpoint saved at index {i + len(batch)}")

Batch lifecycle

Load checkpoint

On startup, read data/checkpoint.txt to find the resume point

start_index = checkpoint.load()  # Returns 0 if no checkpoint exists

Create batch slice

Extract the next batch_size tweets starting from start_index

batch = tweets[i : i + settings.batch_size]

Process tweets sequentially

Analyze each tweet in the batch, skipping retweets

for tweet in batch:
    if _is_retweet(tweet):
        continue
    result = self.analyzer.analyze(tweet)

Write results

Append flagged tweets to results.csv as they’re analyzed

if result.decision == Decision.DELETE:
    writer.write_result(result)

Save checkpoint

After the entire batch completes, save progress

checkpoint.save(i + len(batch))

Repeat

Move to the next batch until all tweets are processed

Configuration

Default batch size

The default batch size is 10 tweets per batch, configured in src/config.py:

src/config.py

@dataclass
class Settings:
    # ... other settings
    batch_size: int = 10

Changing batch size

To modify the batch size, edit src/config.py:

src/config.py

batch_size: int = 50  # Process 50 tweets per batch

Changing batch_size during an active analysis can cause unexpected resume behavior. Always complete or reset your analysis before changing this value.

Checkpointing system

Checkpoints enable reliable resume after interruptions from crashes, Ctrl+C, or API quota exhaustion.

Checkpoint file format

The checkpoint file (data/checkpoint.txt) stores a single integer: the index of the next tweet to process.

# data/checkpoint.txt
50

This means tweets 0-49 have been processed, and processing will resume at index 50.

Checkpoint timing

Checkpoints are saved after each complete batch, not after each tweet. This means if you interrupt processing mid-batch, the entire batch will be re-processed on resume.

src/application.py

for i in range(start_index, len(tweets), settings.batch_size):
    batch = tweets[i : i + settings.batch_size]
    
    # Process all tweets in batch
    for tweet in batch:
        result = self.analyzer.analyze(tweet)
        # ...
    
    # Only save checkpoint after entire batch completes
    checkpoint.save(i + len(batch))

Resume behavior

When you restart analysis, the tool automatically resumes from the saved checkpoint:

# First run (interrupted after batch 2)
$ python src/main.py analyze-tweets
Processing batch 1/100 (tweets 1-10 of 1000)
Checkpoint saved at index 10
Processing batch 2/100 (tweets 11-20 of 1000)
Checkpoint saved at index 20
Processing batch 3/100 (tweets 21-30 of 1000)
^C  # User interrupts

# Second run (resumes from batch 3)
$ python src/main.py analyze-tweets
Resuming from tweet index 20  # Last completed checkpoint
Processing batch 3/100 (tweets 21-30 of 1000)

Retweets are skipped during processing but still count toward batch size and checkpoint indices. This is by design to maintain consistent indexing.

Choosing optimal batch size

Batch size affects three key factors:

Factor	Small Batches (1-10)	Medium Batches (10-50)	Large Batches (50-100+)
Checkpoint frequency	Very frequent	Moderate	Infrequent
Resume precision	Minimal re-work	Some re-work	Significant re-work
Processing overhead	Higher (more I/O)	Balanced	Lower (less I/O)
Memory usage	Minimal	Low	Higher

Recommendation by archive size

Small archives (< 1,000 tweets): Use batch_size=10 (default)

Fast overall processing
Checkpoint overhead is negligible

Medium archives (1,000-5,000 tweets): Use batch_size=25

Good balance of speed and recoverability
~5 minutes of re-work if interrupted

Large archives (5,000+ tweets): Use batch_size=50

Reduces checkpoint file I/O overhead
Accept ~10 minutes of re-work on interruption
Critical for multi-day processing with quota limits

Error handling

Batch processing stops immediately if any tweet fails analysis.

src/application.py

for tweet in batch:
    if _is_retweet(tweet):
        continue
    
    try:
        result = self.analyzer.analyze(tweet)
        # ...
    except Exception as e:
        logger.error(
            f"Failed to analyze tweet {tweet.id}: {e}", exc_info=True
        )
        return Result(
            success=False,
            count=analyzed_count,
            error_type="analysis_failed",
            error_message=str(e),
        )

If analysis fails mid-batch, the checkpoint is not saved. The entire batch will be re-processed when you resume.

Recovery from errors

Identify the error

Check logs to understand why analysis failed:

tail -n 50 logs/analysis.log

Fix the issue

Common fixes:

Check internet connection
Verify API key in .env
Increase RATE_LIMIT_SECONDS if hitting quota
Check for malformed tweets in CSV

Resume analysis

python src/main.py analyze-tweets

Processing resumes from the last successful checkpoint

Retweet handling

Retweets (starting with “RT @”) are automatically skipped but still affect batch indexing.

src/application.py

def _is_retweet(tweet) -> bool:
    return tweet.content.startswith("RT @")

# In processing loop:
for tweet in batch:
    if _is_retweet(tweet):
        continue  # Skip but don't decrement batch count

Why skip retweets?

Not your content: Retweets are others’ words, not yours
Bulk deletion: Most users delete all retweets at once via Twitter’s UI
API efficiency: Saves API quota for analyzing your original tweets

If you want to analyze retweets, remove the _is_retweet() check in src/application.py:95-96. Be aware this increases API costs and processing time.

Result writing

Results are written incrementally using append mode, not in batches.

src/application.py

with CSVWriter(settings.processed_results_path, append=True) as writer:
    for i in range(start_index, len(tweets), settings.batch_size):
        batch = tweets[i : i + settings.batch_size]
        
        for tweet in batch:
            # ...
            if result.decision == Decision.DELETE:
                writer.write_result(result)  # Written immediately

File append behavior

First run: Creates data/tweets/processed/results.csv with headers
Resume runs: Appends new results without duplicating headers
Crash recovery: Already-written results are preserved

Even if analysis crashes mid-batch, any tweets flagged as DELETE before the crash are already saved in results.csv. Only unprocessed tweets in the batch need re-analysis.

Manual checkpoint management

Resetting progress

To start analysis from scratch:

rm data/checkpoint.txt
rm data/tweets/processed/results.csv
python src/main.py analyze-tweets

Skipping to specific position

To resume from a specific tweet index:

echo "500" > data/checkpoint.txt
python src/main.py analyze-tweets

This starts processing at tweet 500, skipping tweets 0-499.

Manually editing checkpoints can cause duplicate results in results.csv if you resume before an already-processed index. Only do this if you’ve also deleted the corresponding results.

Performance optimization

I/O overhead

Each checkpoint write involves:

Opening data/checkpoint.txt
Writing new index
Closing file

For large archives, this overhead becomes significant:

10,000 tweets / batch_size=10 = 1,000 checkpoint writes
10,000 tweets / batch_size=50 = 200 checkpoint writes

Increasing batch size from 10 to 50 reduces checkpoint I/O by 80% for large archives.

Progress visibility

Smaller batches provide more granular progress logging:

# batch_size=10
Processing batch 1/1000 (tweets 1-10 of 10000)
Processing batch 2/1000 (tweets 11-20 of 10000)
# Updates every ~10 seconds

# batch_size=100
Processing batch 1/100 (tweets 1-100 of 10000)
Processing batch 2/100 (tweets 101-200 of 10000)
# Updates every ~100 seconds

Choose based on your preference for feedback frequency vs. I/O efficiency.

Monitoring progress

Track analysis progress in real-time:

# Watch checkpoint file
watch -n 5 cat data/checkpoint.txt

# Count processed results
wc -l data/tweets/processed/results.csv

# Calculate progress percentage
python -c "print(f'{int(open('data/checkpoint.txt').read()) / 10000 * 100:.1f}%')"

Best practices

Use default batch size (10) unless you have a specific reason to change it
Increase batch size for archives > 5,000 tweets to reduce I/O overhead
Never edit checkpoint.txt or results.csv while analysis is running
Back up checkpoint before manual edits: cp data/checkpoint.txt data/checkpoint.txt.bak
Monitor logs to understand batch processing patterns and timing
Plan for interruptions by choosing a batch size that balances speed with acceptable re-work time

Troubleshooting

Duplicate results in CSV

Symptom: Same tweet URL appears multiple times in results.csv Cause: Manually reset checkpoint without deleting results file Solution:

rm data/tweets/processed/results.csv
rm data/checkpoint.txt
python src/main.py analyze-tweets

Analysis seems stuck

Symptom: No log output for several minutes Cause: Large batch size with slow API responses Solution: Wait for batch to complete. Checkpoints only save after full batch. Press Ctrl+C to interrupt, then resume.

Checkpoint not saving

Symptom: After resume, processing starts from 0 instead of last position Cause: Write permissions issue on data/checkpoint.txt Solution:

chmod 600 data/checkpoint.txt
ls -la data/checkpoint.txt  # Verify permissions

Get Started

Guides

Advanced

Support

How batch processing works

Batch lifecycle

Configuration

Default batch size

Changing batch size

Checkpointing system

Checkpoint file format

Checkpoint timing

Resume behavior

Choosing optimal batch size

Recommendation by archive size

Error handling

Recovery from errors

Retweet handling

Why skip retweets?

Result writing

File append behavior

Manual checkpoint management

Resetting progress

Skipping to specific position

Performance optimization

I/O overhead

Progress visibility

Monitoring progress

Best practices

Troubleshooting

Duplicate results in CSV

Analysis seems stuck

Checkpoint not saving

Build docs developers (and LLMs) love

Get Started

Guides

Advanced

Support

​How batch processing works

​Batch lifecycle

​Configuration

​Default batch size

​Changing batch size

​Checkpointing system

​Checkpoint file format

​Checkpoint timing

​Resume behavior

​Choosing optimal batch size

​Recommendation by archive size

​Error handling

​Recovery from errors

​Retweet handling

​Why skip retweets?

​Result writing

​File append behavior

​Manual checkpoint management

​Resetting progress

​Skipping to specific position

​Performance optimization

​I/O overhead

​Progress visibility

​Monitoring progress

​Best practices

​Troubleshooting

​Duplicate results in CSV

​Analysis seems stuck

​Checkpoint not saving

Build docs developers (and LLMs) love

How batch processing works

Batch lifecycle

Configuration

Default batch size

Changing batch size

Checkpointing system

Checkpoint file format

Checkpoint timing

Resume behavior

Choosing optimal batch size

Recommendation by archive size

Error handling

Recovery from errors

Retweet handling

Why skip retweets?

Result writing

File append behavior

Manual checkpoint management

Resetting progress

Skipping to specific position

Performance optimization

I/O overhead

Progress visibility

Monitoring progress

Best practices

Troubleshooting

Duplicate results in CSV

Analysis seems stuck

Checkpoint not saving