Component details

This guide provides detailed information about each component in the Tweet Audit Tool CLI architecture.

Models (models.py)

Defines immutable data structures used throughout the application.

Represents a single tweet from the archive:

models.py

@dataclass(frozen=True)
class Tweet:
    id: str
    content: str

    def __repr__(self) -> str:
        preview = self.content[:50] + "..." if len(self.content) > 50 else self.content
        return f"Tweet(id={self.id!r}, content={preview!r})"

Design decisions:

Why frozen=True?

frozen=True makes the dataclass immutable:

Prevents accidental modification
Makes bugs obvious (crash instead of silent corruption)
Thread-safe by design (though we don’t use threads)
Easier to reason about data flow

Why custom __repr__?

Truncates long tweet content for readable logging:

print(Tweet(id="123", content="x" * 200))
# Tweet(id='123', content='xxxx...xxx')

Decision

Enum representing analysis decisions:

models.py

class Decision(enum.StrEnum):
    DELETE = "DELETE"
    KEEP = "KEEP"

Why StrEnum?

Serializes cleanly to JSON/CSV without custom converters
Type-safe (can’t accidentally use invalid string)
IDE autocomplete support

# StrEnum automatically converts to string
decision = Decision.DELETE
print(decision)  # "DELETE"
json.dumps({"decision": decision})  # {"decision": "DELETE"}

AnalysisResult

Represents the outcome of analyzing a single tweet:

models.py

@dataclass(frozen=True)
class AnalysisResult:
    tweet_url: str
    decision: Decision = Decision.KEEP

Why store URL instead of tweet ID?

Results CSV is action-oriented (ready for deletion)
URLs are directly clickable in spreadsheets
User doesn’t need to manually construct URLs

Result

Wrapper for operation results with error handling:

models.py

@dataclass(frozen=True)
class Result:
    success: bool
    count: int = 0
    error_type: str = ""
    error_message: str = ""

Usage pattern:

# Success case
result = Result(success=True, count=150)

# Error case
result = Result(
    success=False,
    error_type="file_not_found",
    error_message="tweets.json not found"
)

This allows application layer to return errors without raising exceptions, and CLI layer to handle them appropriately.

Configuration (config.py)

Manages application settings with two-tier configuration system.

Settings class

config.py

@dataclass
class Settings:
    tweets_archive_path: str = "data/tweets/tweets.json"
    transformed_tweets_path: str = "data/tweets/transformed/tweets.csv"
    checkpoint_path: str = "data/checkpoint.txt"
    processed_results_path: str = "data/tweets/processed/results.csv"
    base_twitter_url: str = "https://x.com"
    x_username: str = "iamuchihadan"
    gemini_api_key: str = ""
    gemini_model: str = "gemini-2.5-flash"
    batch_size: int = 10
    rate_limit_seconds: float = 1.0
    criteria: Criteria = field(default_factory=Criteria)

Two-tier configuration

Environment variables (.env)
Config file (config.json)

For secrets and user-specific values:

.env

GEMINI_API_KEY=your_api_key_here
X_USERNAME=your_twitter_handle
RATE_LIMIT_SECONDS=1.0

API keys (must not be committed)
Usernames (vary per user)
Rate limits (tuning parameters)

For analysis criteria:

config.json

{
  "criteria": {
    "topics_to_exclude": [
      "Profanity or unprofessional language",
      "Personal attacks or insults"
    ],
    "tone_requirements": [
      "Professional language only"
    ],
    "forbidden_words": ["badword1", "badword2"],
    "additional_instructions": "Flag any content that could harm professional reputation"
  }
}

Analysis criteria (content rules)
Easy to share/version without exposing secrets
Optional (falls back to sensible defaults)

Caching with lru_cache

config.py

@lru_cache(maxsize=1)
def load_settings() -> Settings:
    configure_logging()
    # Load from .env and config.json
    return Settings(...)

settings: Settings = load_settings()

Why cache with maxsize=1?Settings should be loaded once and reused. The cache prevents re-parsing .env and config.json on every access. With maxsize=1, we cache exactly one configuration (the current one).

Default criteria

config.py

def _default_criteria() -> Criteria:
    return Criteria(
        forbidden_words=[],
        topics_to_exclude=[
            "Profanity or unprofessional language",
            "Personal attacks or insults",
            "Outdated political opinions",
        ],
        tone_requirements=[
            "Professional language only",
            "Respectful communication",
        ],
        additional_instructions="Flag any content that could harm professional reputation",
    )

Benefits:

Lowers barrier to entry (users can start immediately)
Provides sensible defaults for professional cleanup
Can be customized later via config.json

Storage (storage.py)

Handles all file I/O operations with context managers for resource safety.

JSONParser

Parses Twitter archive JSON format:

storage.py

class JSONParser(Parser):
    def parse(self) -> list[Tweet]:
        with open(self.file_path, encoding=FILE_ENCODING) as file:
            data = json.load(file)
            return [
                Tweet(
                    id=item["tweet"][TWITTER_ARCHIVE_ID_FIELD],
                    content=item["tweet"][TWITTER_ARCHIVE_TEXT_FIELD],
                )
                for item in data
            ]

Expected format:

[
  {
    "tweet": {
      "id_str": "1234567890",
      "full_text": "Tweet content here"
    }
  }
]

Raises descriptive errors for missing files, invalid JSON, or missing required fields.

CSVParser

Parses transformed tweets CSV:

storage.py

class CSVParser(Parser):
    def parse(self) -> list[Tweet]:
        with open(self.file_path, encoding=FILE_ENCODING) as file:
            reader = csv.DictReader(file)
            return [
                Tweet(id=row[TWEET_CSV_ID_COLUMN], content=row[TWEET_CSV_TEXT_COLUMN])
                for row in reader
            ]

Expected format:

id,text
1234567890,Tweet content here

CSVWriter

Writes tweets or results to CSV with context manager:

storage.py

class CSVWriter:
    def __enter__(self) -> "CSVWriter":
        # Create directories if needed
        os.makedirs(dir_path, mode=PRIVATE_DIR_MODE, exist_ok=True)
        
        # Open file for writing
        self.file = open(self.file_path, mode, encoding=FILE_ENCODING, newline="")
        self.writer = csv.writer(self.file)
        
        # Set restrictive permissions
        os.chmod(self.file_path, PRIVATE_FILE_MODE)
        return self

    def __exit__(self, exc_type, exc_value, traceback) -> bool:
        if self.file:
            self.file.close()
        return False

Why context managers?

Guarantees file closure (prevents resource leaks)
Automatic cleanup on exceptions
Explicit lifecycle management

Usage:

with CSVWriter(path) as writer:
    writer.write_tweets(tweets)
# File automatically closed, even if exception occurs

File permissions

storage.py

PRIVATE_FILE_MODE = 0o600  # Owner read/write only
PRIVATE_DIR_MODE = 0o750   # Owner rwx, group rx

Security: Your tweets are personal data. These permissions ensure only you can read them (no group/other access on shared systems).

Checkpoint

Manages progress tracking for resume functionality:

storage.py

class Checkpoint:
    def load(self) -> int:
        """Returns 0 if file empty/missing, saved index otherwise"""
        self.file.seek(0)
        content = self.file.read().strip()
        return int(content) if content else 0
    
    def save(self, tweet_index: int) -> None:
        """Atomically updates checkpoint"""
        self.file.seek(0)
        self.file.truncate()
        self.file.write(str(tweet_index))
        self.file.flush()  # Force write to disk

Checkpoint format:

Single integer representing the next tweet index to process. Resume behavior:

Process stops at tweet 500 → checkpoint saves 500
On restart → load returns 500 → skip tweets 0-499

Analyzer (analyzer.py)

Handles AI integration with Gemini API, including rate limiting and retry logic.

Lazy initialization pattern

application.py

class Application:
    def __init__(self):
        self._analyzer = None  # Not created yet
    
    @property
    def analyzer(self) -> Gemini:
        if self._analyzer is None:
            self._analyzer = Gemini()  # Create on first use
        return self._analyzer

Why lazy initialization?Creating the Gemini client requires an API key. If you run extract-tweets (which doesn’t need AI), we shouldn’t validate the API key or hit Google’s servers. Lazy init defers that cost until actually needed.

Rate limiting

analyzer.py

def _rate_limit(self) -> None:
    elapsed = time.time() - self.last_request_time
    if elapsed < self.min_request_interval:
        time.sleep(self.min_request_interval - elapsed)
    self.last_request_time = time.time()

Enforces minimum time between requests to prevent hitting API rate limits:

Request 1 ──(1.0s)──> Request 2 ──(1.0s)──> Request 3

Simple but effective - no token buckets or sliding windows needed.

Retry with exponential backoff

analyzer.py

@retry_with_backoff(max_retries=3, initial_delay=1.0)
def analyze(self, tweet: Tweet) -> AnalysisResult:
    # API call here

Retries transient errors with increasing delays:

Attempt 1

Immediate execution

Attempt 2 (if needed)

Wait ~1 second before retry

Attempt 3 (if needed)

Wait ~2 seconds before retry

Failure

Raise exception after max retries

Retryable errors:

Timeout
Connection errors
Rate limit (429)
Service unavailable (503)
“temporarily unavailable”

Why exponential? Gives the API service time to recover. If temporarily overloaded, backing off reduces load.

Prompt construction

analyzer.py

def _build_prompt(self, tweet: Tweet) -> str:
    criteria_list = "\n".join(f"{i + 1}. {c}" for i, c in enumerate(criteria_parts))
    
    return f"""You are evaluating tweets for a professional's Twitter cleanup.

Tweet ID: {tweet.id}
Tweet: "{tweet.content}"

Mark for deletion if it violates any of these criteria:
{criteria_list}{additional}

Respond in JSON format:
{{
  "decision": "DELETE" or "KEEP",
  "reason": "brief explanation"
}}"""

Prompt engineering principles:

Provides context: What we’re evaluating (professional cleanup)
Lists criteria: Explicit numbered list from config
Demands structure: JSON output with decision and reason
Shows examples: Mentions “DELETE” or “KEEP” format

Why JSON response?

Machine-parsable (no regex to extract decision)
Gemini supports response_mime_type="application/json"
Structured = easier error handling
Type-safe parsing

Application (application.py)

The orchestrator that connects all components and implements business logic.

Extract pipeline

application.py

def extract_tweets(self) -> Result:
    try:
        # 1. Parse JSON archive
        parser = JSONParser(settings.tweets_archive_path)
        tweets = parser.parse()
        
        # 2. Write to CSV
        with CSVWriter(settings.transformed_tweets_path) as writer:
            writer.write_tweets(tweets)
        
        return Result(success=True, count=len(tweets))
    except Exception as e:
        return self._build_error_result(e, context="extraction")

Flow:

tweets.json → JSONParser → List[Tweet] → CSVWriter → tweets.csv

Analysis pipeline

application.py

def analyze_tweets(self) -> Result:
    with Checkpoint(checkpoint_path) as checkpoint:
        start_index = checkpoint.load()
        
        for i in range(start_index, len(tweets), batch_size):
            batch = tweets[i:i+batch_size]
            
            for tweet in batch:
                if _is_retweet(tweet):
                    continue
                
                result = self.analyzer.analyze(tweet)
                
                if result.decision == Decision.DELETE:
                    writer.write_result(result)
            
            checkpoint.save(i + len(batch))  # Save after batch

Flow:

tweets.csv → CSVParser → List[Tweet] → Gemini AI → results.csv
                              ↓                ↓
                        Checkpoint ←──────────┘

Batch processing

Batch size (default: 10) is the resume granularity:

10,000 tweets with batch_size=10 → 1,000 checkpoints
Interrupted at tweet 5,234 → resume at tweet 5,230 (start of batch)

Smaller batches = more frequent checkpoints = more I/O overhead Larger batches = less frequent checkpoints = more rework on failureDefault of 10 balances these concerns.

Retweet filtering

application.py

def _is_retweet(tweet) -> bool:
    return tweet.content.startswith("RT @")

Why skip retweets?

Retweets aren’t your original content
Deleting a retweet doesn’t affect your original tweets
Focus analysis on what you actually wrote

Error mapping

application.py

@staticmethod
def _build_error_result(e: Exception, context: str = "") -> Result:
    if isinstance(e, FileNotFoundError):
        return Result(success=False, error_type="file_not_found", ...)
    elif isinstance(e, ValueError):
        return Result(success=False, error_type="invalid_format", ...)
    # ...

Maps Python exceptions to domain-specific error types. CLI can show user-friendly messages based on error_type instead of raw stack traces.

CLI (main.py)

Thin interface layer that handles user interaction.

main.py

def main() -> None:
    parser = argparse.ArgumentParser(...)
    args = parser.parse_args()
    
    app = Application()
    
    if args.command == "extract-tweets":
        result = app.extract_tweets()
        if not result.success:
            print(f"Error: {result.error_message}", file=sys.stderr)
            sys.exit(1)
        print(f"Successfully extracted {result.count} tweets")

Responsibilities:

Parse command-line arguments
Call application methods
Format output for humans
Exit with appropriate status code

Separation benefit: Later, you could add web API (Flask/FastAPI), GUI (Tkinter/Qt), or automated pipeline (cron job) - all would reuse the Application class without modification.

Technical Documentation

Development

Models (models.py)

Tweet

Decision

AnalysisResult

Result

Configuration (config.py)

Settings class

Two-tier configuration

Caching with lru_cache

Default criteria

Storage (storage.py)

JSONParser

CSVParser

CSVWriter

File permissions

Checkpoint

Analyzer (analyzer.py)

Lazy initialization pattern

Rate limiting

Retry with exponential backoff

Prompt construction

Application (application.py)

Extract pipeline

Analysis pipeline

Batch processing

Retweet filtering

Error mapping

CLI (main.py)

Next steps

Data flow

Design decisions

Build docs developers (and LLMs) love

Technical Documentation

Development

​Models (models.py)

​Tweet

​Decision

​AnalysisResult

​Result

​Configuration (config.py)

​Settings class

​Two-tier configuration

​Caching with lru_cache

​Default criteria

​Storage (storage.py)

​JSONParser

​CSVParser

​CSVWriter

​File permissions

​Checkpoint

​Analyzer (analyzer.py)

​Lazy initialization pattern

​Rate limiting

​Retry with exponential backoff

​Prompt construction

​Application (application.py)

​Extract pipeline

​Analysis pipeline

​Batch processing

​Retweet filtering

​Error mapping

​CLI (main.py)

​Next steps

Data flow

Design decisions

Build docs developers (and LLMs) love

Models (models.py)

Tweet

Decision

AnalysisResult

Result

Configuration (config.py)

Settings class

Two-tier configuration

Caching with lru_cache

Default criteria

Storage (storage.py)

JSONParser

CSVParser

CSVWriter

File permissions

Checkpoint

Analyzer (analyzer.py)

Lazy initialization pattern

Rate limiting

Retry with exponential backoff

Prompt construction

Application (application.py)

Extract pipeline

Analysis pipeline

Batch processing

Retweet filtering

Error mapping

CLI (main.py)

Next steps