Skip to main content

Overview

JARVIS (formerly SPECTER) is a real-time person intelligence platform built for the Browser Use + YC web agents hackathon. The system processes visual input from cameras or uploads, identifies individuals through facial recognition, and autonomously gathers comprehensive intelligence through a coordinated agent swarm.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        JARVIS ARCHITECTURE                          │
│                                                                     │
│  ┌──────────┐    ┌──────────────┐    ┌───────────────────────────┐ │
│  │  CAMERA   │───▶│  CAPTURE     │───▶│    IDENTIFICATION         │ │
│  │  INPUT    │    │  PIPELINE    │    │    (Face Search + LLM)    │ │
│  └──────────┘    └──────────────┘    └───────────┬───────────────┘ │
│                                                   │                 │
│                                                   ▼                 │
│                                      ┌───────────────────────────┐ │
│                                      │   RESEARCH ORCHESTRATOR   │ │
│                                      │   (Agent Swarm Manager)   │ │
│                                      └───────────┬───────────────┘ │
│                                                   │                 │
│                         ┌─────────────────────────┼────────────┐   │
│                         ▼            ▼            ▼            ▼   │
│                    ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│
│                    │LinkedIn │ │ X/Twitter│ │Instagram│ │  Exa   ││
│                    │ Agent   │ │  Agent   │ │  Agent  │ │  API   ││
│                    └────┬────┘ └────┬────┘ └────┬────┘ └───┬────┘│
│                         │           │           │          │      │
│                         └─────────┬─┴───────────┴──────────┘      │
│                                   ▼                                │
│                      ┌────────────────────────┐                    │
│                      │  SYNTHESIS ENGINE      │                    │
│                      │  (LLM Aggregation)     │                    │
│                      └──────────┬─────────────┘                    │
│                                 │                                   │
│                    ┌────────────┼────────────┐                     │
│                    ▼            ▼            ▼                     │
│              ┌──────────┐ ┌──────────┐ ┌──────────────┐           │
│              │ Convex   │ │ MongoDB  │ │  Corkboard   │           │
│              │ Realtime │ │ Storage  │ │  Frontend    │           │
│              └──────────┘ └──────────┘ └──────────────┘           │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  OBSERVABILITY: Laminar (tracing) + HUD (agent debugging)   │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Core Components

1. Capture Pipeline

Location: backend/pipeline.py Purpose: Process uploaded images/videos, detect faces, and initiate the identification and research pipeline. Key Responsibilities:
  • Frame extraction from videos using FFmpeg
  • Face detection using MediaPipe
  • Face embedding generation with ArcFace
  • Pipeline orchestration through all stages
Implementation:
class CapturePipeline:
    async def process(
        self,
        capture_id: str,
        data: bytes,
        content_type: str,
        source: str = "manual_upload",
        person_name: str | None = None,
    ) -> PipelineResult:
        # 1. Extract frames from the uploaded media
        frames = extract_frames(data, content_type)
        
        # 2. Detect faces in each frame
        for frame_bytes in frames:
            detection_result = await self._detector.detect_faces(request)
            
            # 3. Generate embeddings for each face
            embedding = self._embedder.embed(face, frame_bytes)
            
            # 4. Identify via face search
            resolved_name = await self._identify_face(embedding, face_image)
            
            # 5. Create person record
            await self._db.store_person(person_id, person_data)
        
        # 6. Enrich each identified person (parallel, error-isolated)
        results = await asyncio.gather(*enrichment_tasks, return_exceptions=True)
See Capture Pipeline for detailed implementation.

2. Identification System

Location: backend/identification/ Purpose: Identify individuals from face images using multi-tiered face search. Strategy:
  • Tier 1: PimEyes (purpose-built face search)
  • Tier 2: Reverse image search (Google, Yandex, Bing) as fallback
Implementation:
class FaceSearchManager:
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        # Try PimEyes first
        pimeyes_result = await self._pimeyes.search_face(request)
        if pimeyes_result.success and pimeyes_result.matches:
            return pimeyes_result
        
        # Fallback to reverse image search
        return await self._reverse.search_face(request)
See Identification for detailed implementation.

3. Research Orchestrator

Location: backend/agents/orchestrator.py Purpose: Coordinate parallel research across multiple platforms and data sources. Two-Phase Architecture: Phase 1: Static Agents + API Enrichment
  • LinkedIn Agent (Voyager API)
  • Twitter/X Agent (GraphQL reverse engineering)
  • Instagram Agent (browser scraping)
  • Google Agent (web search)
  • OSINT Agent (aggregated searches)
  • Social Agent (username enumeration)
  • Exa API (structured person search)
Phase 2: Dynamic URL Scrapers
  • Spawned for high-value URLs discovered in Phase 1
  • Skips domains already covered by static agents
  • Limited to 3 concurrent scrapers to manage resources
Implementation:
class ResearchOrchestrator:
    async def research_person(self, request: ResearchRequest) -> OrchestratorResult:
        # Phase 1: Launch all static agents in parallel
        tasks = {agent.agent_name: asyncio.create_task(agent.run(request)) 
                 for agent in self._static_agents}
        
        # Wait for completion with timeout
        done, pending = await asyncio.wait(tasks.values(), timeout=timeout)
        
        # Phase 2: Spawn dynamic scrapers for discovered URLs
        dynamic_results = await self._run_dynamic_scrapers(request, exa_result.hits, remaining_time)
See Agent Swarm for detailed implementation.

4. Synthesis Engine

Location: backend/synthesis/ Purpose: Aggregate all gathered intelligence into a coherent, structured dossier. Dual-Engine Strategy:
  • Primary: Anthropic Claude (configurable)
  • Fallback: Google Gemini 2.0 Flash (cheaper, faster)
Output Structure:
@dataclass
class DossierReport:
    summary: str
    occupation: str | None
    organization: str | None
    work_history: list[WorkExperience]
    education: list[Education]
    social_profiles: dict[str, str]
    notable_activity: list[str]
    conversation_hooks: list[str]
    risk_flags: list[str]
    confidence_score: float
Connection Detection: After synthesis, the system automatically detects relationships between people:
  • Shared employers (colleagues)
  • Shared educational institutions (classmates)
  • Mutual connections on social platforms

5. Real-Time Data Layer

Location: frontend/convex/ Purpose: Power real-time UI updates as intelligence streams in from agents. Why Convex?
  • Zero WebSocket boilerplate
  • Automatic real-time subscriptions
  • Built-in state management
  • Perfect for the “streaming corkboard” effect
Schema:
// Convex schema (schema.ts)
export default defineSchema({
  persons: defineTable({
    name: v.string(),
    photoUrl: v.string(),
    confidence: v.number(),
    status: v.union(
      v.literal("identified"),
      v.literal("researching"),
      v.literal("synthesizing"),
      v.literal("complete")
    ),
    boardPosition: v.object({ x: v.number(), y: v.number() }),
    dossier: v.optional(v.object({...})),
    createdAt: v.number(),
    updatedAt: v.number(),
  })),
  
  intelFragments: defineTable({
    personId: v.id("persons"),
    source: v.string(),
    content: v.string(),
    timestamp: v.number(),
  }).index("by_person", ["personId"]),
  
  connections: defineTable({
    personAId: v.id("persons"),
    personBId: v.id("persons"),
    relationshipType: v.string(),
    description: v.string(),
  }),
});
MongoDB for Persistence:
  • Raw image storage via GridFS
  • Capture metadata and embeddings
  • Historical data and audit logs

6. Corkboard Frontend

Location: frontend/ Tech Stack:
  • Next.js 14 (App Router)
  • Framer Motion (animations)
  • Tailwind CSS (styling)
  • Convex React hooks (real-time subscriptions)
Real-Time Features:
// Frontend real-time subscriptions
const persons = useQuery(api.persons.listAll);
const intel = useQuery(api.intel.byPerson, { personId });
const connections = useQuery(api.connections.byPerson, { personId });
See Real-Time Streaming for detailed implementation.

End-to-End Data Flow

1

Capture

Camera input → Frame extraction → Face detection → MongoDB (raw image) + Convex (capture record)
2

Identify

PimEyes face search → Name extraction → Convex (person record, status: “identified”)Frontend: Paper spawns on corkboard with photo + name
3

Research

Orchestrator spawns 6+ agents in parallel → Each agent extracts data → Convex (intel fragments stream in)Frontend: Paper updates in real-time as data arrives
4

Synthesize

LLM aggregates all fragments → Structured dossier → Convex (person.dossier updated, status: “complete”)Frontend: Paper fully revealed, connections drawn
5

Notify

Complete dossier available → Live activity feed updated

Technology Stack

LayerTechnologyRationale
BackendPython FastAPIBrowser Use SDK is Python, async support
Face DetectionMediaPipe5-10ms/frame, 95%+ accuracy, zero C++ deps
Face EmbeddingsInsightFace ArcFace512-dim embeddings, state-of-the-art matching
Agent OrchestrationBrowser Use APIHackathon sponsor, required for demo
LLM (Primary)Anthropic ClaudeHigh-quality synthesis, structured output
LLM (Fallback)Gemini 2.0 Flash25x cheaper than GPT-4o, 2x faster
Real-Time DBConvexZero WebSocket boilerplate, auto-subscriptions
Persistent DBMongoDB AtlasFree cluster, GridFS for images
Research APIExa200ms person search, structured results
FrontendNext.js + Framer MotionFast dev, great animations, Vercel deploy
ObservabilityLaminarAgent tracing, accuracy verification
MemorySuperMemoryCross-session dossier caching

Design Decisions

Why Python backend instead of full JavaScript stack?The Browser Use SDK is Python-native, and the existing codebase was Python. Rewriting would waste hackathon time. Python’s async/await support makes it ideal for coordinating parallel agents.
Why Convex over raw WebSockets?Convex eliminates WebSocket boilerplate entirely. Real-time subscriptions work out of the box. In a 24-hour hackathon, speed of development trumps flexibility. The streaming corkboard effect requires instant updates as each agent completes.
Why multiple LLM engines?Dual-engine strategy provides resilience. If the primary engine fails (rate limits, API issues), the fallback engine ensures synthesis always completes. Gemini 2.0 Flash is 25x cheaper than GPT-4o, making it perfect for bulk synthesis.
Hackathon Trade-offsThis architecture prioritizes demo impact over production scalability:
  • No authentication/authorization
  • Limited error recovery
  • Partial results acceptable
  • 3-minute hard timeout per person
  • Streaming partial results > waiting for completion

Performance Characteristics

Typical Timeline (per person):
  • Face detection: 5-10ms per frame
  • Face search (PimEyes): 10-30 seconds
  • Static agents (Phase 1): 20-60 seconds (parallel)
  • Dynamic scrapers (Phase 2): 10-30 seconds (parallel)
  • Synthesis: 5-10 seconds
  • Total: 45-130 seconds from capture to complete dossier
Scaling:
  • Multiple faces per frame: processed in sequence (not parallelized)
  • Multiple people: enrichment runs in parallel (error-isolated)
  • Agent failures: logged but never block pipeline completion

Observability

Laminar Tracing: Every LLM call and agent run is traced:
from observability.laminar import traced

@traced("pipeline.identify_face")
async def _identify_face(self, embedding: list[float], image_data: bytes):
    # Traced execution
    search_result = await self._face_searcher.search_face(search_request)
Dashboard Features:
  • Real-time agent status
  • LLM call latency and token usage
  • Error tracking and retry logic
  • Accuracy verification for synthesis

Next Steps

Capture Pipeline

Deep dive into frame extraction, face detection, and pipeline orchestration

Identification

How PimEyes and reverse image search identify people from faces

Agent Swarm

Two-phase research orchestration with parallel agent execution

Real-Time Streaming

Convex subscriptions powering the live corkboard interface

Build docs developers (and LLMs) love