Language: Python vs alternatives
Decision
Use Python as the implementation language.Rationale
Advantages
Advantages
- Fast prototyping: Dynamic typing allows quick iteration
- AI ecosystem: Google’s
generativeaiSDK is Python-native - Rich libraries: Excellent support for CSV, JSON, HTTP
- Low barrier: Most developers know Python
- Rapid experimentation: Easy to test different prompts and approaches
Disadvantages
Disadvantages
- Runtime performance: Slower than compiled languages
- Dependency management: Requires Poetry/pip setup
- No single binary: Can’t distribute as standalone executable
- Type safety: Runtime errors instead of compile-time
- Memory usage: Higher overhead than Go/Rust
Alternatives considered
- Go
- TypeScript/Node.js
- Rust
Pros:
- Single binary distribution
- Fast execution
- Low memory footprint
- Excellent concurrency support
- Unofficial Gemini SDK (lower quality)
- More boilerplate code
- Slower iteration on prompts
- Steeper learning curve
Conclusion: Python’s AI ecosystem maturity and rapid prototyping capabilities outweigh performance concerns for this I/O-bound, single-user tool.
Processing: Sequential vs concurrent
Decision
Process tweets sequentially (one at a time) instead of concurrently.Rationale
- Sequential (chosen)
- Concurrent (alternative)
How it works:Advantages:
- Natural rate limiting (no complex throttling)
- Simple error handling (pause on first failure)
- Predictable behavior (linear execution)
- Lower memory usage (one request in flight)
- Easy to debug (clear execution trace)
- Slower (5,000 tweets = ~1.5 hours at 1 req/sec)
- Doesn’t utilize API concurrency limits
- Idle CPU during API waits
Trade-off analysis
Sequential processing trades speed for simplicity:Storage: CSV vs database
Decision
Use CSV files for storage instead of a database.Rationale
- CSV (chosen)
- SQLite (alternative)
- PostgreSQL (alternative)
Advantages:
- Zero setup (no database installation)
- Human-readable (open in Excel)
- Easy to share (email a file)
- Version control friendly
- Sufficient for up to 100K tweets
- Works offline completely
- No query capabilities
- Must load entire file to read
- No referential integrity
- Limited data types
- No concurrent writes
Decision: CSV simplicity and human-readability align perfectly with the tool’s personal-use target. If users need advanced queries, they can import CSV into any database.
Architecture: Layered vs flat
Decision
Use three-layer architecture (CLI → Application → Infrastructure) instead of flat structure.Rationale
Layered (chosen)
Layered (chosen)
Structure:Advantages:
- Clear separation of concerns
- Easy to test (mock dependencies)
- Reusable application layer
- Swappable components (e.g., change AI provider)
- Easier to understand (clear boundaries)
Flat (alternative)
Flat (alternative)
Structure:Advantages:
- Fewer files (easier to navigate initially)
- No import management
- Faster to write initially
- God object (1000+ lines)
- Hard to test (mock what?)
- Tight coupling (can’t swap components)
- Difficult to understand
- Changes cascade everywhere
Decision: Layered architecture provides testability and maintainability at the cost of a few extra files. For a tool with multiple workflows (extract, analyze), this pays off immediately.
Error handling: Exceptions vs Result type
Decision
Use Result type at application layer, exceptions at infrastructure layer.Rationale
Hybrid approach:- Advantages
- Alternatives
- Clear boundaries: Exceptions stay in infrastructure, Results flow up
- Explicit errors: CLI knows possible error types
- No surprises: Result type forces error handling
- Easy testing: Can assert on Result fields
- User-friendly: Map error_type to helpful messages
Decision: Hybrid approach balances ergonomics (exceptions for infrastructure) with explicitness (Result for business logic).
Checkpointing: Per-tweet vs per-batch
Decision
Checkpoint after each batch (default: 10 tweets) instead of after each tweet.Rationale
| Aspect | Per-Tweet | Per-Batch (10) |
|---|---|---|
| I/O overhead | High (1 write/tweet) | Low (1 write/10 tweets) |
| Resume granularity | Exact tweet | Batch start (lose up to 9) |
| Disk wear | Higher | Lower |
| Checkpoint operations | 5,000 for 5K tweets | 500 for 5K tweets |
| Rework on interrupt | 0 tweets | 0-9 tweets |
Decision: Per-batch checkpointing reduces I/O by 10x with minimal rework penalty. Users can adjust
batch_size to balance these concerns:batch_size=1→ per-tweet (no rework)batch_size=100→ less I/O, more rework
Configuration: Environment vs file vs CLI args
Decision
Use two-tier configuration: environment variables (secrets) + optional JSON file (criteria).Rationale
- Environment (.env)
- Config file (config.json)
- CLI args (not used)
For secrets and runtime settings:Advantages:
.env
- Secrets never committed to git
- Easy to override (export RATE_LIMIT_SECONDS=0.5)
- Standard practice (12-factor app)
- Works with .env file or actual env vars
- API keys
- Usernames
- Rate limits
- Model selection
Retry logic: Exponential backoff vs fixed delay
Decision
Use exponential backoff for retry delays instead of fixed delays.Rationale
- Server overload scenario
- Rate limit scenario
Problem: Gemini API returns 503 (service unavailable)Fixed delay behavior:Exponential backoff behavior:Exponential gives server more time to recover.
analyzer.py
Jitter (
time.time() % 1) adds randomness to prevent thundering herd (many clients retrying at exact same time).Immutability: Frozen dataclasses vs mutable
Decision
Use@dataclass(frozen=True) for all data models.
Rationale
Prevents accidental modification
Prevents accidental modification
Makes bugs obvious
Makes bugs obvious
With mutable data, bugs are silent:With frozen data, bugs crash immediately:
Thread-safe by design
Thread-safe by design
Immutable objects can be safely shared across threads:Mutable objects require synchronization:
Trade-off: Creating new objects instead of modifying requires more memory. For our use case (tweets are small, processed one at a time), this is negligible.
Output: All tweets vs deletion candidates only
Decision
Results CSV contains only tweets marked for deletion, not all analyzed tweets.Rationale
| Aspect | Deletion-Only | All-Tweets |
|---|---|---|
| File size | Small (5-10% of tweets) | Large (100% of tweets) |
| Focus | Action items only | Complete audit trail |
| Review time | Fast (50 tweets to review) | Slow (5,000 tweets to review) |
| Audit trail | Lost (can’t see KEEP decisions) | Complete (can review later) |
| Use case | ”What should I delete?" | "Why did you KEEP tweet X?” |
Deletion-only output (50 rows)
All-tweets output (5,000 rows)
Decision: Deletion-only output aligns with the tool’s purpose (cleanup). Users want a focused checklist, not a full audit trail. If they need to review KEEP decisions, they can re-run the analysis.
Summary of key trade-offs
Speed vs simplicity
Chose: Simplicity (sequential processing)Trade-off: 10x slower, but easier to debug and maintain
Features vs deployment
Chose: Deployment (CSV over database)Trade-off: Fewer features, but zero setup required
Performance vs safety
Chose: Safety (immutable data, frequent checkpoints)Trade-off: Higher memory and I/O, but no data loss
Flexibility vs clarity
Chose: Clarity (layered architecture)Trade-off: More files, but clearer responsibilities
When to revisit these decisions
These decisions are appropriate for the current use case (personal cleanup tool, occasional use, up to 50K tweets). Consider alternatives if:- Processing >100K tweets regularly → Use concurrent processing, database storage
- Multiple users → Add job queue, user management
- Real-time requirements → Stream processing, WebSocket updates
- Production SaaS → All of the above + monitoring, rate limiting per user, etc.
Next steps
Architecture overview
Review the high-level system design
Component details
Dive into implementation specifics