System layers
The architecture separates concerns into three distinct layers:CLI layer (main.py)
The presentation layer handles user interaction:- Parses command-line arguments
- Delegates work to the application layer
- Formats results for human consumption
- Manages exit codes
The CLI layer is deliberately thin - it contains no business logic. This allows the application layer to be reused in other contexts (web API, GUI, automated pipeline) without modification.
Application layer (application.py)
The orchestration layer contains business logic:- Coordinates between storage and analyzer components
- Implements checkpoint/resume workflow
- Handles batch processing
- Manages error recovery
Infrastructure layer
The foundation layer provides specialized services:- Storage (storage.py): File operations (JSON parsing, CSV I/O, checkpointing)
- Analyzer (analyzer.py): AI integration (Gemini client, prompt engineering, rate limiting)
- Config (config.py): Configuration management (environment variables, criteria loading)
- Models (models.py): Data structures (Tweet, AnalysisResult, Decision)
Core design principles
Separation of concerns
Each module has a single responsibility:- Easy to test (mock one layer without affecting others)
- Easy to swap implementations (e.g., switch AI providers)
- Easy to understand (clear boundaries between components)
Fail-safe defaults
The tool prioritizes not losing progress over speed:- Checkpoints after every batch (default: 10 tweets)
- Conservative rate limiting (1 req/sec default)
- Retry logic for transient failures
- All-or-nothing batch processing
Trade-off: Slower execution, but you can interrupt (Ctrl+C) anytime and resume exactly where you left off.
Sequential processing
Processes tweets one at a time (with batching for checkpoints), not concurrently. Why sequential?- Natural rate limiting (no complex throttling needed)
- Simple error handling (one failure = pause and preserve state)
- Predictable behavior (easier debugging)
- Lower memory usage
Trade-off: Takes longer than parallel processing. Concurrent processing would be 5-10x faster but adds retry coordination, partial failure handling, and complex state management.
Technology choices
Python
Why Python?- Fast prototyping and iteration
- First-class AI/ML library support
- Google’s
generativeaiSDK is Python-native - Dynamic typing simplifies prompt engineering experiments
- Runtime performance overhead
- Dependency management complexity (Poetry required)
- No single-binary distribution
CSV for storage
Why CSV instead of database?- Simpler deployment (no DB setup required)
- Human-readable output (open in Excel/Google Sheets)
- Easy to version control or share
- Sufficient for typical dataset size (up to 100K tweets)
- No query capabilities
- Must load entire file to read
- Limited type safety
Immutable data structures
All data models use@dataclass(frozen=True):
- Prevents accidental modification
- Makes bugs obvious (crash instead of silent corruption)
- Thread-safe by design
- Easier to reason about data flow
Execution model
The tool runs as a long-running process that completes all batches in one execution:- Simple for users (no cron jobs or manual re-runs)
- Automatic resume on interruption
- Single execution context (easier debugging)
- Longer-running processes (harder to monitor)
- Can’t schedule batches with different resource constraints
- No built-in distributed processing
Error handling philosophy
Three-tier error handling strategy:1. Retry at boundary (Analyzer)
Transient API errors get automatic retry with exponential backoff:2. Graceful degradation (Application)
- Single tweet failure → abort batch, preserve checkpoint
- File errors → return error Result, don’t crash
- Empty tweet list → success with count=0
3. User feedback (CLI)
- Success → positive message + count
- Failure → error message + appropriate exit code
- No silent failures
Security considerations
Secrets management
- API keys stored in
.env(gitignored) - No secrets in code or
config.json - Validation on first use (fail fast)
File permissions
All files created with restrictive permissions:Data privacy
- All processing is local (tweets never sent elsewhere except Gemini API)
- Results stored locally (you control deletion)
- No analytics or telemetry
- No external dependencies beyond Gemini API
Performance characteristics
Time complexity
- Extract: O(n) - read JSON once, write CSV once
- Analyze: O(n) - one API call per tweet
Space complexity
- Extract: O(n) - must hold all tweets in memory
- Analyze: O(batch_size) - only current batch in memory
Bottlenecks
API latency (500-2000ms per request)- Cannot be eliminated (external API constraint)
- Mitigated by batch checkpoints (pause/resume anytime)
- Handled by retry logic with exponential backoff
- Checkpoint mechanism preserves progress
- Current implementation loads all tweets for analysis
- Could be improved with streaming CSV parser (future enhancement)
Scalability limits
Works well for
Works well for
- 1,000 - 50,000 tweets
- Single user on laptop/desktop
- Gemini free tier API limits (15 RPM)
- Occasional cleanup tasks
Struggles with
Struggles with
- 100,000+ tweets (hours of processing)
- Multiple concurrent users (no concurrency support)
- Real-time requirements (sequential = slow)
- High-throughput scenarios (1 req/sec max by default)
To scale beyond these limits
To scale beyond these limits
- Implement async/await for concurrent API calls
- Use Redis for distributed checkpoints
- Deploy as microservice with job queue
- Use batch API if available from provider
Next steps
Component details
Deep dive into each module’s responsibilities and implementation
Data flow
Understand how data moves through the system
Design decisions
Learn about key trade-offs and why they were made