Testing criteria

Before analyzing your entire Twitter archive, test your criteria configuration on a small sample to ensure the AI interprets your rules as intended. This prevents wasting API quota and time on poorly-tuned criteria.

Why test criteria?

AI interpretation of deletion criteria is nuanced and can produce unexpected results:

Too strict: Flags tweets you want to keep
Too lenient: Misses tweets you want to delete
Ambiguous rules: Produces inconsistent decisions
Wrong context: Doesn’t understand your professional domain

Testing on 10-20 tweets takes 2-3 minutes but can save hours of re-processing with refined criteria.

Creating a test sample

Extract a small subset of tweets that represent the diversity of your archive.

Run initial extraction

First, extract your full archive to CSV:

python src/main.py extract-tweets

This creates data/tweets/transformed/tweets.csv

Create test subset

Select 10-20 diverse tweets covering:

Different time periods (old and recent)
Various topics you’ve tweeted about
Different tones (professional, casual, humorous)
Edge cases you’re unsure about

# Copy first 20 tweets to test file
head -n 21 data/tweets/transformed/tweets.csv > data/tweets/transformed/test_tweets.csv

# Or manually curate specific tweet IDs
head -n 1 data/tweets/transformed/tweets.csv > data/tweets/transformed/test_tweets.csv
grep -E "(1234567890|9876543210|1111111111)" data/tweets/transformed/tweets.csv >> data/tweets/transformed/test_tweets.csv

Update configuration temporarily

Edit src/config.py to use the test file:

src/config.py

@dataclass
class Settings:
    transformed_tweets_path: str = "data/tweets/transformed/test_tweets.csv"  # Temporarily use test file
    # ... other settings

Testing workflow

Iteratively test and refine your criteria until results match your expectations.

Initial test run

Configure baseline criteria

Start with conservative criteria in config.json:

config.json

{
  "criteria": {
    "forbidden_words": ["damn", "wtf"],
    "topics_to_exclude": [
      "Profanity or unprofessional language"
    ],
    "tone_requirements": [
      "Professional language only"
    ],
    "additional_instructions": "Flag content inappropriate for professional profile"
  }
}

Run analysis on test sample

python src/main.py analyze-tweets

This should complete in 1-2 minutes for 10-20 tweets.

Review results

Open data/tweets/processed/results.csv and manually check each flagged tweet:

tweet_url,deleted
https://x.com/username/status/1234567890,false
https://x.com/username/status/9876543210,false

For each URL:

Visit the tweet
Read the content
Decide: Should this actually be deleted?

Identify misclassifications

Categorize errors:

False positives: Flagged but should keep
False negatives: Not flagged but should delete
Correct decisions: Properly classified

Track patterns:

False positives:
- Tweet 1234567890: Casual tone but professional content
- Tweet 1111111111: Contains "damn" but in a quote

False negatives:
- Tweet 9999999999: Outdated political opinion not caught
- Tweet 8888888888: Sarcastic tone missed

Adjust criteria

Modify config.json based on patterns:

config.json

{
  "criteria": {
    "forbidden_words": ["damn", "wtf", "crypto"],  // Added "crypto"
    "topics_to_exclude": [
      "Profanity or unprofessional language",
      "Outdated political opinions from 2020-2022"  // More specific
    ],
    "tone_requirements": [
      "Professional language only",
      "No sarcasm or cynical humor"  // Address false negative
    ],
    "additional_instructions": "Flag content inappropriate for professional profile. Ignore casual tone if content is substantive. Be strict on political content."
  }
}

Reset and re-test

# Clear previous test results
rm data/checkpoint.txt
rm data/tweets/processed/results.csv

# Run analysis again with refined criteria
python src/main.py analyze-tweets

Repeat until satisfied

Continue refining until you achieve acceptable accuracy on your test sample. Aim for:

< 10% false positives: Very few good tweets flagged
< 10% false negatives: Very few bad tweets missed

You’ll never achieve 100% accuracy with AI classification. The goal is “good enough” - when false positives/negatives are rare and tolerable.

Criteria patterns and examples

Pattern: Too broad

Problem: Rule catches more than intended

// Too broad
"topics_to_exclude": ["Politics"]
// Flags ALL political content, including neutral policy discussions

// Better
"topics_to_exclude": ["Partisan political opinions and endorsements"]
// Flags only opinionated political content

Pattern: Too vague

Problem: AI interprets rule differently than you intended

// Too vague
"tone_requirements": ["Be nice"]
// Unclear what "nice" means

// Better
"tone_requirements": [
  "No personal attacks",
  "No mocking or belittling others",
  "Respectful disagreement only"
]
// Specific, actionable criteria

Pattern: Conflicting rules

Problem: Rules contradict each other

// Conflicting
"topics_to_exclude": ["Technical jargon"],
"additional_instructions": "Keep tweets about software engineering"
// Confused: software engineering requires technical jargon

// Better
"topics_to_exclude": ["Excessive unexplained acronyms"],
"additional_instructions": "Keep tweets about software engineering. Technical terms are fine if used professionally."

Pattern: Missing context

Problem: AI doesn’t understand your domain

// Missing context
"forbidden_words": ["kill"]
// Flags "kill the bug" in software context

// Better
"forbidden_words": ["kill"],
"additional_instructions": "Ignore software development terminology like 'kill the process' or 'kill the bug'"

Advanced testing techniques

Stratified sampling

Ensure your test sample covers all important categories:

# Extract tweets from different years
grep "^[^,]*,2018" data/tweets/transformed/tweets.csv | head -n 5 > test_tweets.csv
grep "^[^,]*,2020" data/tweets/transformed/tweets.csv | head -n 5 >> test_tweets.csv
grep "^[^,]*,2024" data/tweets/transformed/tweets.csv | head -n 5 >> test_tweets.csv

Edge case focus

Manually curate tweets you’re uncertain about:

Search your archive for borderline content
Extract those specific tweet IDs to test file
Test criteria specifically on these edge cases
Refine rules to handle them correctly

Edge cases often reveal the real quality of your criteria. A rule that works on obvious examples but fails on nuanced content needs refinement.

A/B testing criteria

Compare two different criteria configurations:

# Test configuration A
cp config_strict.json config.json
python src/main.py analyze-tweets
mv data/tweets/processed/results.csv results_strict.csv

# Test configuration B
cp config_lenient.json config.json
rm data/checkpoint.txt
python src/main.py analyze-tweets
mv data/tweets/processed/results.csv results_lenient.csv

# Compare results
diff results_strict.csv results_lenient.csv

Choose the configuration with better accuracy on your test sample.

Interpreting AI decisions

The Gemini model returns both a decision (DELETE or KEEP) and a reason. Enable debug logging to see reasons:

export LOG_LEVEL=DEBUG
python src/main.py analyze-tweets

The current implementation doesn’t expose the reason field from Gemini’s response. To see reasons during testing, temporarily modify src/analyzer.py to log them:

src/analyzer.py

try:
    data = json.loads(response.text)
    decision = Decision(data["decision"].upper())
    reason = data.get("reason", "No reason provided")
    logger.debug(f"Tweet {tweet.id}: {decision.value} - {reason}")  # Add this line
except ...

Common AI interpretation patterns

Your rule	AI often interprets as	Example
”Profanity”	Explicit curse words only	Flags: “damn”, “hell” Misses: “crap”, “sucks"
"Unprofessional”	Casual/informal tone	Flags: “lol”, “gonna” Misses: Sarcasm
”Political”	Partisan opinions	Flags: “Vote for X” Misses: Neutral policy
”Offensive”	Explicit insults	Flags: “you’re an idiot” Misses: Subtle mockery

If the AI consistently misinterprets a rule, make it more explicit in additional_instructions with specific examples of what should/shouldn’t be flagged.

Validation metrics

Track metrics across test iterations to measure improvement:

Iteration 1:
- Test sample: 20 tweets
- Flagged: 8 tweets
- False positives: 3 (37.5%)
- False negatives: 2 (10% of 20)
- Accuracy: 75%

Iteration 2 (refined criteria):
- Test sample: 20 tweets  
- Flagged: 6 tweets
- False positives: 1 (16.7%)
- False negatives: 1 (5% of 20)
- Accuracy: 90%

Accuracy formula:

Accuracy = (Correct decisions) / (Total tweets) * 100
         = (20 - 1 false positive - 1 false negative) / 20 * 100
         = 90%

When to stop testing

You’re ready for full archive analysis when:

False positive rate < 10%: Rarely flags good tweets
False negative rate < 10%: Rarely misses bad tweets
Consistent across samples: Test on multiple samples with similar accuracy
Edge cases handled: Known borderline tweets classified correctly
You trust the pattern: Understand why each decision was made

Perfection is impossible with AI. Accept that you’ll manually review flagged tweets anyway, so 90% accuracy is often sufficient.

Full archive analysis preparation

Restore full dataset

Revert src/config.py to use the full CSV:

src/config.py

transformed_tweets_path: str = "data/tweets/transformed/tweets.csv"

Clear test artifacts

rm data/checkpoint.txt
rm data/tweets/processed/results.csv
rm data/tweets/transformed/test_tweets.csv

Finalize criteria

Save your tested config.json configuration

Run full analysis

python src/main.py analyze-tweets

Now with confidence in your criteria!

Best practices

Always test first: Never run full analysis without testing criteria
Use real tweets: Don’t test on synthetic examples - use actual archive tweets
Document reasoning: Keep notes on why you adjusted each rule
Test incrementally: Make one change at a time to understand its impact
Expect manual review: Plan to manually check flagged tweets anyway
Update over time: If results on full archive differ from test, refine and re-run

Troubleshooting

All tweets flagged as DELETE

Cause: Criteria too strict Solution: Relax rules, add context to additional_instructions:

"additional_instructions": "Only flag clearly problematic content. When in doubt, choose KEEP."

No tweets flagged

Cause: Criteria too lenient Solution: Add more specific rules, include forbidden_words list

Inconsistent results

Cause: Ambiguous or conflicting criteria Solution: Rewrite rules to be more specific and unambiguous. Remove conflicting instructions.

Results don’t match expectations

Cause: AI interprets rules differently than you intended Solution: Add explicit examples to additional_instructions:

"additional_instructions": "Flag tweets like 'I hate [topic]' but keep tweets like 'I disagree with [topic] because [reason]'"

Get Started

Guides

Advanced

Support

Why test criteria?

Creating a test sample

Testing workflow

Initial test run

Refinement iteration

Criteria patterns and examples

Pattern: Too broad

Pattern: Too vague

Pattern: Conflicting rules

Pattern: Missing context

Advanced testing techniques

Stratified sampling

Edge case focus

A/B testing criteria

Interpreting AI decisions

Common AI interpretation patterns

Validation metrics

When to stop testing

Full archive analysis preparation

Best practices

Troubleshooting

All tweets flagged as DELETE

No tweets flagged

Inconsistent results

Results don’t match expectations

Build docs developers (and LLMs) love

Get Started

Guides

Advanced

Support

​Why test criteria?

​Creating a test sample

​Testing workflow

​Initial test run

​Refinement iteration

​Criteria patterns and examples

​Pattern: Too broad

​Pattern: Too vague

​Pattern: Conflicting rules

​Pattern: Missing context

​Advanced testing techniques

​Stratified sampling

​Edge case focus

​A/B testing criteria

​Interpreting AI decisions

​Common AI interpretation patterns

​Validation metrics

​When to stop testing

​Full archive analysis preparation

​Best practices

​Troubleshooting

​All tweets flagged as DELETE

​No tweets flagged

​Inconsistent results

​Results don’t match expectations

Build docs developers (and LLMs) love

Why test criteria?

Creating a test sample

Testing workflow

Initial test run

Refinement iteration

Criteria patterns and examples

Pattern: Too broad

Pattern: Too vague

Pattern: Conflicting rules

Pattern: Missing context

Advanced testing techniques

Stratified sampling

Edge case focus

A/B testing criteria

Interpreting AI decisions

Common AI interpretation patterns

Validation metrics

When to stop testing

Full archive analysis preparation

Best practices

Troubleshooting

All tweets flagged as DELETE

No tweets flagged

Inconsistent results

Results don’t match expectations