Analyzing tweets

Once you’ve extracted tweets from your archive, you can analyze them using Google’s Gemini AI to identify which tweets no longer align with your values.

Prerequisites

Before analyzing, ensure you have:

✅ Extracted tweets to data/tweets/transformed/tweets.csv
✅ Set GEMINI_API_KEY in your .env file
✅ Configured analysis criteria in config.json (optional)

If you haven’t extracted tweets yet, see the Extracting tweets guide.

Running the analysis

Start the analysis with a single command:

python src/main.py analyze-tweets

What happens during analysis

The analysis process follows these steps from application.py:64-122:

Load extracted tweets

The tool reads your transformed CSV file:

logger.info(f"Loading tweets from {settings.transformed_tweets_path}")
parser = CSVParser(settings.transformed_tweets_path)
tweets = parser.parse()

if not tweets:
    logger.warning("No tweets found to analyze")
    return Result(success=True, count=0)

logger.info(f"Loaded {len(tweets)} tweets for analysis")

Initialize checkpoint system

The checkpoint allows resuming if interrupted:

with Checkpoint(settings.checkpoint_path) as checkpoint:
    start_index = checkpoint.load()
    logger.info(f"Resuming from tweet index {start_index}")

On first run, start_index is 0. On subsequent runs, it’s the last saved position.

Process tweets in batches

Tweets are analyzed in batches (default: 10 tweets per batch):

for i in range(start_index, len(tweets), settings.batch_size):
    batch = tweets[i : i + settings.batch_size]
    batch_num = (i // settings.batch_size) + 1
    total_batches = (len(tweets) + settings.batch_size - 1) // settings.batch_size

    logger.info(
        f"Processing batch {batch_num}/{total_batches} "
        f"(tweets {i + 1}-{min(i + len(batch), len(tweets))} of {len(tweets)})"
    )

Skip retweets

Retweets are automatically skipped:

for tweet in batch:
    if _is_retweet(tweet):
        continue

From application.py:125-126:

def _is_retweet(tweet) -> bool:
    return tweet.content.startswith("RT @")

Analyze each tweet

Each tweet is sent to Gemini AI for analysis:

try:
    result = self.analyzer.analyze(tweet)
    logger.debug(f"Tweet {tweet.id}: {result.decision.value}")
    analyzed_count += 1

    if result.decision == Decision.DELETE:
        writer.write_result(result)
except Exception as e:
    logger.error(f"Failed to analyze tweet {tweet.id}: {e}")
    return Result(
        success=False,
        count=analyzed_count,
        error_type="analysis_failed",
        error_message=str(e),
    )

Save checkpoint

After each batch, progress is saved:

checkpoint.save(i + len(batch))
logger.info(f"Checkpoint saved at index {i + len(batch)}")

Expected output

During analysis, you’ll see progress logs like this:

Analyzing tweets...
2024-01-15 11:00:00 - application - INFO - Loading tweets from data/tweets/transformed/tweets.csv
2024-01-15 11:00:00 - application - INFO - Loaded 1523 tweets for analysis
2024-01-15 11:00:00 - application - INFO - Resuming from tweet index 0
2024-01-15 11:00:00 - __main__ - INFO - Gemini analyzer initialized
2024-01-15 11:00:01 - application - INFO - Processing batch 1/153 (tweets 1-10 of 1523)
2024-01-15 11:00:12 - application - INFO - Checkpoint saved at index 10
2024-01-15 11:00:12 - application - INFO - Processing batch 2/153 (tweets 11-20 of 1523)
2024-01-15 11:00:23 - application - INFO - Checkpoint saved at index 20
...

How the AI analysis works

The Gemini analyzer sends each tweet with a custom prompt built from your criteria.

Prompt construction

From analyzer.py:107-135, the prompt is built like this:

def _build_prompt(self, tweet: Tweet) -> str:
    criteria_parts = []

    # Add topics and tone requirements
    criteria_parts.extend(settings.criteria.topics_to_exclude)
    criteria_parts.extend(settings.criteria.tone_requirements)

    # Add forbidden words
    if settings.criteria.forbidden_words:
        words = ", ".join(settings.criteria.forbidden_words)
        criteria_parts.append(f"Contains any of these words: {words}")

    # Format as numbered list
    criteria_list = "\n".join(f"{i + 1}. {c}" for i, c in enumerate(criteria_parts))

    # Add additional instructions
    additional = ""
    if settings.criteria.additional_instructions:
        additional = f"\n\nAdditional guidance: {settings.criteria.additional_instructions}"

    return f"""You are evaluating tweets for a professional's Twitter cleanup.

Tweet ID: {tweet.id}
Tweet: "{tweet.content}"

Mark for deletion if it violates any of these criteria:
{criteria_list}{additional}

Respond in JSON format:
{{
  "decision": "DELETE" or "KEEP",
  "reason": "brief explanation"
}}"""

Example prompt

With default criteria, a tweet gets this prompt:

You are evaluating tweets for a professional's Twitter cleanup.

Tweet ID: 1234567890
Tweet: "This crypto project is going to the moon! 🚀"

Mark for deletion if it violates any of these criteria:
1. Profanity or unprofessional language
2. Personal attacks or insults
3. Outdated political opinions
4. Professional language only
5. Respectful communication
6. Contains any of these words: crypto, NFT, web3

Additional guidance: Flag any content that could harm professional reputation

Respond in JSON format:
{
  "decision": "DELETE" or "KEEP",
  "reason": "brief explanation"
}

API request

The prompt is sent to Gemini from analyzer.py:70-80:

@retry_with_backoff(max_retries=3, initial_delay=1.0)
def analyze(self, tweet: Tweet) -> AnalysisResult:
    self._rate_limit()  # Enforce rate limiting before API call

    prompt = self._build_prompt(tweet)

    response = self.client.models.generate_content(
        model=self.model,
        contents=prompt,
        config=genai.types.GenerateContentConfigDict(response_mime_type="application/json"),
    )

The response_mime_type="application/json" ensures Gemini returns structured JSON we can parse.

Response parsing

Gemini returns a JSON decision that’s parsed and validated:

if not response.text:
    raise ValueError(f"Empty response from Gemini for tweet {tweet.id}")

try:
    data = json.loads(response.text)
except json.JSONDecodeError as e:
    raise ValueError(
        f"Invalid Gemini response for tweet {tweet.id}: {e}"
    ) from e

try:
    decision = Decision(data["decision"].upper())
except KeyError as e:
    raise ValueError(
        f"Missing decision field in Gemini response for tweet {tweet.id}"
    ) from e

return AnalysisResult(tweet_url=settings.tweet_url(tweet.id), decision=decision)

Rate limiting and retry logic

Rate limiting

To avoid hitting API limits, the analyzer enforces delays between requests from analyzer.py:64-68:

def _rate_limit(self) -> None:
    elapsed = time.time() - self.last_request_time
    if elapsed < self.min_request_interval:
        time.sleep(self.min_request_interval - elapsed)
    self.last_request_time = time.time()

The default interval is 1.0 seconds (configurable via RATE_LIMIT_SECONDS).

Automatic retries

Transient errors trigger automatic retries with exponential backoff from analyzer.py:11-49:

def retry_with_backoff(max_retries: int = 3, initial_delay: float = 1.0):
    """Retry decorator with exponential backoff for transient errors"""

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            delay = initial_delay
            last_exception = None

            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    error_str = str(e).lower()
                    is_retryable = any(
                        keyword in error_str
                        for keyword in [
                            "timeout",
                            "connection",
                            "rate limit",
                            "quota",
                            "503",
                            "429",
                            "temporarily unavailable",
                        ]
                    )

                    if not is_retryable or attempt == max_retries - 1:
                        raise

                    sleep_time = delay * (2**attempt) + (time.time() % 1)
                    time.sleep(sleep_time)

            raise last_exception

        return wrapper

    return decorator

Retryable errors include:

Timeouts
Connection issues
Rate limit errors (429)
Server errors (503)
Quota exceeded

The retry logic automatically handles temporary API issues. You don’t need to manually restart.

Analysis results

Output file

Tweets flagged for deletion are written to data/tweets/processed/results.csv:

tweet_url,deleted
https://x.com/username/status/1234567890,false
https://x.com/username/status/9876543210,false
https://x.com/username/status/5555555555,false

From storage.py:172-181:

def write_result(self, result: AnalysisResult) -> None:
    if not self.writer:
        raise RuntimeError("CSVWriter is not open")

    if not self.header_written:
        self.writer.writerow([RESULT_CSV_URL_COLUMN, RESULT_CSV_DELETED_COLUMN])
        self.header_written = True

    self.writer.writerow([result.tweet_url, CSV_BOOL_FALSE])
    self.file.flush()

Only tweets with Decision.DELETE are written to the results file. Tweets marked KEEP are not recorded.

The deleted column

The deleted column starts as false for all tweets. This is for manual tracking:

Review each tweet URL in your browser
Decide if you agree with the AI’s assessment
If you delete it, update the row to deleted=true
Track your cleanup progress

Completion summary

When analysis finishes, you’ll see:

2024-01-15 11:45:30 - application - INFO - Analysis complete. Results written to data/tweets/processed/results.csv
Successfully analyzed 1523 tweets

What counts as 'analyzed'?

Only original tweets are counted. Retweets (starting with “RT @”) are skipped and not included in the analyzed count.From application.py:94-96:

for tweet in batch:
    if _is_retweet(tweet):
        continue

Performance estimates

Processing time

With default settings:

Rate limit: 1 second per tweet
Batch size: 10 tweets
Checkpoint frequency: Every 10 tweets

For 1,000 tweets:

Time: ~17 minutes (1 second × 1,000 tweets)
API calls: ~1,000 requests
Cost: Free within Gemini’s daily limits

Scaling considerations

Small archive (< 1,000 tweets)
Medium archive (1,000-10,000 tweets)
Large archive (> 10,000 tweets)

Processing time: 10-20 minutes
API usage: Well within free tier
Recommendation: Use default settings

Monitoring progress

Log verbosity

Control detail level with LOG_LEVEL in .env:

# See every tweet's decision
LOG_LEVEL=DEBUG

# See batch progress only (recommended)
LOG_LEVEL=INFO

# See only warnings and errors
LOG_LEVEL=WARNING

Checking intermediate results

While analysis runs, you can check progress:

# Count flagged tweets so far
wc -l data/tweets/processed/results.csv

# View latest flagged tweets
tail -20 data/tweets/processed/results.csv

# Check checkpoint position
cat data/checkpoint.txt

Interrupting analysis

You can safely stop analysis at any time:

Ctrl+C: Graceful interruption
System crash: Progress saved after each batch
API quota exceeded: Automatically stops with error

To resume, just run the same command again:

python src/main.py analyze-tweets

See the Resume interrupted analysis guide for details on checkpoint behavior.

Troubleshooting analysis

No tweets found

Error: CSV file not found: data/tweets/transformed/tweets.csv

Solution: Run extraction first:

python src/main.py extract-tweets

Missing API key

Error: GEMINI_API_KEY is required

Solution: Add your API key to .env:

echo "GEMINI_API_KEY=your_key_here" >> .env

Rate limit exceeded

Failed to analyze tweet 123456: 429 Quota exceeded

Solution: The tool will retry automatically. If it persists:

Increase rate limit delay:
```
RATE_LIMIT_SECONDS=2.0
```
Wait 24 hours for quota reset
Or upgrade to paid Gemini API

Analysis fails on specific tweet

Error: Failed to analyze tweet 987654321: Invalid response

Solution: The analysis stops on errors. Check:

Tweet content might have special characters causing parsing issues
Review the tweet manually at: https://x.com/username/status/987654321

Skip it by manually incrementing the checkpoint:

echo "10" > data/checkpoint.txt  # Skip to tweet index 10

Next steps

After analysis completes:

Review results

Understand the results CSV and deletion workflow

Customize criteria

Fine-tune analysis criteria for better results

Resume analysis

Learn about checkpoint and resume features

Get Started

Guides

Advanced

Support

Prerequisites

Running the analysis

What happens during analysis

Expected output

How the AI analysis works

Prompt construction

Example prompt

API request

Response parsing

Rate limiting and retry logic

Rate limiting

Automatic retries

Analysis results

Output file

The deleted column

Completion summary

Performance estimates

Processing time

Scaling considerations

Monitoring progress

Log verbosity

Checking intermediate results

Interrupting analysis

Troubleshooting analysis

No tweets found

Missing API key

Rate limit exceeded

Analysis fails on specific tweet

Next steps

Review results

Customize criteria

Resume analysis

Build docs developers (and LLMs) love

Get Started

Guides

Advanced

Support

​Prerequisites

​Running the analysis

​What happens during analysis

​Expected output

​How the AI analysis works

​Prompt construction

​Example prompt

​API request

​Response parsing

​Rate limiting and retry logic

​Rate limiting

​Automatic retries

​Analysis results

​Output file

​The deleted column

​Completion summary

​Performance estimates

​Processing time

​Scaling considerations

​Monitoring progress

​Log verbosity

​Checking intermediate results

​Interrupting analysis

​Troubleshooting analysis

​No tweets found

​Missing API key

​Rate limit exceeded

​Analysis fails on specific tweet

​Next steps

Review results

Customize criteria

Resume analysis

Build docs developers (and LLMs) love

Prerequisites

Running the analysis

What happens during analysis

Expected output

How the AI analysis works

Prompt construction

Example prompt

API request

Response parsing

Rate limiting and retry logic

Rate limiting

Automatic retries

Analysis results

Output file

The deleted column

Completion summary

Performance estimates

Processing time

Scaling considerations

Monitoring progress

Log verbosity

Checking intermediate results

Interrupting analysis

Troubleshooting analysis

No tweets found

Missing API key

Rate limit exceeded

Analysis fails on specific tweet

Next steps