{ "criteria": { "topics_to_exclude": [ "Profanity or unprofessional language", "Personal attacks or insults" ], "tone_requirements": [ "Professional language only" ], "forbidden_words": ["badword1", "badword2"], "additional_instructions": "Flag any content that could harm professional reputation" }}
@lru_cache(maxsize=1)def load_settings() -> Settings: configure_logging() # Load from .env and config.json return Settings(...)settings: Settings = load_settings()
Why cache with maxsize=1?Settings should be loaded once and reused. The cache prevents re-parsing .env and config.json on every access. With maxsize=1, we cache exactly one configuration (the current one).
def _default_criteria() -> Criteria: return Criteria( forbidden_words=[], topics_to_exclude=[ "Profanity or unprofessional language", "Personal attacks or insults", "Outdated political opinions", ], tone_requirements=[ "Professional language only", "Respectful communication", ], additional_instructions="Flag any content that could harm professional reputation", )
Benefits:
Lowers barrier to entry (users can start immediately)
Provides sensible defaults for professional cleanup
class JSONParser(Parser): def parse(self) -> list[Tweet]: with open(self.file_path, encoding=FILE_ENCODING) as file: data = json.load(file) return [ Tweet( id=item["tweet"][TWITTER_ARCHIVE_ID_FIELD], content=item["tweet"][TWITTER_ARCHIVE_TEXT_FIELD], ) for item in data ]
class CSVParser(Parser): def parse(self) -> list[Tweet]: with open(self.file_path, encoding=FILE_ENCODING) as file: reader = csv.DictReader(file) return [ Tweet(id=row[TWEET_CSV_ID_COLUMN], content=row[TWEET_CSV_TEXT_COLUMN]) for row in reader ]
class Application: def __init__(self): self._analyzer = None # Not created yet @property def analyzer(self) -> Gemini: if self._analyzer is None: self._analyzer = Gemini() # Create on first use return self._analyzer
Why lazy initialization?Creating the Gemini client requires an API key. If you run extract-tweets (which doesn’t need AI), we shouldn’t validate the API key or hit Google’s servers. Lazy init defers that cost until actually needed.
def _build_prompt(self, tweet: Tweet) -> str: criteria_list = "\n".join(f"{i + 1}. {c}" for i, c in enumerate(criteria_parts)) return f"""You are evaluating tweets for a professional's Twitter cleanup.Tweet ID: {tweet.id}Tweet: "{tweet.content}"Mark for deletion if it violates any of these criteria:{criteria_list}{additional}Respond in JSON format:{{ "decision": "DELETE" or "KEEP", "reason": "brief explanation"}}"""
Prompt engineering principles:
Provides context: What we’re evaluating (professional cleanup)
Lists criteria: Explicit numbered list from config
Demands structure: JSON output with decision and reason
Shows examples: Mentions “DELETE” or “KEEP” format
def analyze_tweets(self) -> Result: with Checkpoint(checkpoint_path) as checkpoint: start_index = checkpoint.load() for i in range(start_index, len(tweets), batch_size): batch = tweets[i:i+batch_size] for tweet in batch: if _is_retweet(tweet): continue result = self.analyzer.analyze(tweet) if result.decision == Decision.DELETE: writer.write_result(result) checkpoint.save(i + len(batch)) # Save after batch
Batch size (default: 10) is the resume granularity:
10,000 tweets with batch_size=10 → 1,000 checkpoints
Interrupted at tweet 5,234 → resume at tweet 5,230 (start of batch)
Smaller batches = more frequent checkpoints = more I/O overhead
Larger batches = less frequent checkpoints = more rework on failureDefault of 10 balances these concerns.
Thin interface layer that handles user interaction.
main.py
def main() -> None: parser = argparse.ArgumentParser(...) args = parser.parse_args() app = Application() if args.command == "extract-tweets": result = app.extract_tweets() if not result.success: print(f"Error: {result.error_message}", file=sys.stderr) sys.exit(1) print(f"Successfully extracted {result.count} tweets")
Responsibilities:
Parse command-line arguments
Call application methods
Format output for humans
Exit with appropriate status code
Separation benefit: Later, you could add web API (Flask/FastAPI), GUI (Tkinter/Qt), or automated pipeline (cron job) - all would reuse the Application class without modification.