Overview
JARVIS (formerly SPECTER) is a real-time person intelligence platform built for the Browser Use + YC web agents hackathon. The system processes visual input from cameras or uploads, identifies individuals through facial recognition, and autonomously gathers comprehensive intelligence through a coordinated agent swarm.High-Level Architecture
Core Components
1. Capture Pipeline
Location:backend/pipeline.py
Purpose: Process uploaded images/videos, detect faces, and initiate the identification and research pipeline.
Key Responsibilities:
- Frame extraction from videos using FFmpeg
- Face detection using MediaPipe
- Face embedding generation with ArcFace
- Pipeline orchestration through all stages
2. Identification System
Location:backend/identification/
Purpose: Identify individuals from face images using multi-tiered face search.
Strategy:
- Tier 1: PimEyes (purpose-built face search)
- Tier 2: Reverse image search (Google, Yandex, Bing) as fallback
3. Research Orchestrator
Location:backend/agents/orchestrator.py
Purpose: Coordinate parallel research across multiple platforms and data sources.
Two-Phase Architecture:
Phase 1: Static Agents + API Enrichment
- LinkedIn Agent (Voyager API)
- Twitter/X Agent (GraphQL reverse engineering)
- Instagram Agent (browser scraping)
- Google Agent (web search)
- OSINT Agent (aggregated searches)
- Social Agent (username enumeration)
- Exa API (structured person search)
- Spawned for high-value URLs discovered in Phase 1
- Skips domains already covered by static agents
- Limited to 3 concurrent scrapers to manage resources
4. Synthesis Engine
Location:backend/synthesis/
Purpose: Aggregate all gathered intelligence into a coherent, structured dossier.
Dual-Engine Strategy:
- Primary: Anthropic Claude (configurable)
- Fallback: Google Gemini 2.0 Flash (cheaper, faster)
- Shared employers (colleagues)
- Shared educational institutions (classmates)
- Mutual connections on social platforms
5. Real-Time Data Layer
Location:frontend/convex/
Purpose: Power real-time UI updates as intelligence streams in from agents.
Why Convex?
- Zero WebSocket boilerplate
- Automatic real-time subscriptions
- Built-in state management
- Perfect for the “streaming corkboard” effect
- Raw image storage via GridFS
- Capture metadata and embeddings
- Historical data and audit logs
6. Corkboard Frontend
Location:frontend/
Tech Stack:
- Next.js 14 (App Router)
- Framer Motion (animations)
- Tailwind CSS (styling)
- Convex React hooks (real-time subscriptions)
End-to-End Data Flow
Capture
Camera input → Frame extraction → Face detection → MongoDB (raw image) + Convex (capture record)
Identify
PimEyes face search → Name extraction → Convex (person record, status: “identified”)Frontend: Paper spawns on corkboard with photo + name
Research
Orchestrator spawns 6+ agents in parallel → Each agent extracts data → Convex (intel fragments stream in)Frontend: Paper updates in real-time as data arrives
Synthesize
LLM aggregates all fragments → Structured dossier → Convex (person.dossier updated, status: “complete”)Frontend: Paper fully revealed, connections drawn
Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Backend | Python FastAPI | Browser Use SDK is Python, async support |
| Face Detection | MediaPipe | 5-10ms/frame, 95%+ accuracy, zero C++ deps |
| Face Embeddings | InsightFace ArcFace | 512-dim embeddings, state-of-the-art matching |
| Agent Orchestration | Browser Use API | Hackathon sponsor, required for demo |
| LLM (Primary) | Anthropic Claude | High-quality synthesis, structured output |
| LLM (Fallback) | Gemini 2.0 Flash | 25x cheaper than GPT-4o, 2x faster |
| Real-Time DB | Convex | Zero WebSocket boilerplate, auto-subscriptions |
| Persistent DB | MongoDB Atlas | Free cluster, GridFS for images |
| Research API | Exa | 200ms person search, structured results |
| Frontend | Next.js + Framer Motion | Fast dev, great animations, Vercel deploy |
| Observability | Laminar | Agent tracing, accuracy verification |
| Memory | SuperMemory | Cross-session dossier caching |
Design Decisions
Why Python backend instead of full JavaScript stack?The Browser Use SDK is Python-native, and the existing codebase was Python. Rewriting would waste hackathon time. Python’s async/await support makes it ideal for coordinating parallel agents.
Why Convex over raw WebSockets?Convex eliminates WebSocket boilerplate entirely. Real-time subscriptions work out of the box. In a 24-hour hackathon, speed of development trumps flexibility. The streaming corkboard effect requires instant updates as each agent completes.
Why multiple LLM engines?Dual-engine strategy provides resilience. If the primary engine fails (rate limits, API issues), the fallback engine ensures synthesis always completes. Gemini 2.0 Flash is 25x cheaper than GPT-4o, making it perfect for bulk synthesis.
Performance Characteristics
Typical Timeline (per person):- Face detection: 5-10ms per frame
- Face search (PimEyes): 10-30 seconds
- Static agents (Phase 1): 20-60 seconds (parallel)
- Dynamic scrapers (Phase 2): 10-30 seconds (parallel)
- Synthesis: 5-10 seconds
- Total: 45-130 seconds from capture to complete dossier
- Multiple faces per frame: processed in sequence (not parallelized)
- Multiple people: enrichment runs in parallel (error-isolated)
- Agent failures: logged but never block pipeline completion
Observability
Laminar Tracing: Every LLM call and agent run is traced:- Real-time agent status
- LLM call latency and token usage
- Error tracking and retry logic
- Accuracy verification for synthesis
Next Steps
Capture Pipeline
Deep dive into frame extraction, face detection, and pipeline orchestration
Identification
How PimEyes and reverse image search identify people from faces
Agent Swarm
Two-phase research orchestration with parallel agent execution
Real-Time Streaming
Convex subscriptions powering the live corkboard interface