System Architecture

Overview

JARVIS (formerly SPECTER) is a real-time person intelligence platform built for the Browser Use + YC web agents hackathon. The system processes visual input from cameras or uploads, identifies individuals through facial recognition, and autonomously gathers comprehensive intelligence through a coordinated agent swarm.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        JARVIS ARCHITECTURE                          │
│                                                                     │
│  ┌──────────┐    ┌──────────────┐    ┌───────────────────────────┐ │
│  │  CAMERA   │───▶│  CAPTURE     │───▶│    IDENTIFICATION         │ │
│  │  INPUT    │    │  PIPELINE    │    │    (Face Search + LLM)    │ │
│  └──────────┘    └──────────────┘    └───────────┬───────────────┘ │
│                                                   │                 │
│                                                   ▼                 │
│                                      ┌───────────────────────────┐ │
│                                      │   RESEARCH ORCHESTRATOR   │ │
│                                      │   (Agent Swarm Manager)   │ │
│                                      └───────────┬───────────────┘ │
│                                                   │                 │
│                         ┌─────────────────────────┼────────────┐   │
│                         ▼            ▼            ▼            ▼   │
│                    ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│
│                    │LinkedIn │ │ X/Twitter│ │Instagram│ │  Exa   ││
│                    │ Agent   │ │  Agent   │ │  Agent  │ │  API   ││
│                    └────┬────┘ └────┬────┘ └────┬────┘ └───┬────┘│
│                         │           │           │          │      │
│                         └─────────┬─┴───────────┴──────────┘      │
│                                   ▼                                │
│                      ┌────────────────────────┐                    │
│                      │  SYNTHESIS ENGINE      │                    │
│                      │  (LLM Aggregation)     │                    │
│                      └──────────┬─────────────┘                    │
│                                 │                                   │
│                    ┌────────────┼────────────┐                     │
│                    ▼            ▼            ▼                     │
│              ┌──────────┐ ┌──────────┐ ┌──────────────┐           │
│              │ Convex   │ │ MongoDB  │ │  Corkboard   │           │
│              │ Realtime │ │ Storage  │ │  Frontend    │           │
│              └──────────┘ └──────────┘ └──────────────┘           │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │  OBSERVABILITY: Laminar (tracing) + HUD (agent debugging)   │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Core Components

1. Capture Pipeline

Location: backend/pipeline.py Purpose: Process uploaded images/videos, detect faces, and initiate the identification and research pipeline. Key Responsibilities:

Frame extraction from videos using FFmpeg
Face detection using MediaPipe
Face embedding generation with ArcFace
Pipeline orchestration through all stages

Implementation:

class CapturePipeline:
    async def process(
        self,
        capture_id: str,
        data: bytes,
        content_type: str,
        source: str = "manual_upload",
        person_name: str | None = None,
    ) -> PipelineResult:
        # 1. Extract frames from the uploaded media
        frames = extract_frames(data, content_type)
        
        # 2. Detect faces in each frame
        for frame_bytes in frames:
            detection_result = await self._detector.detect_faces(request)
            
            # 3. Generate embeddings for each face
            embedding = self._embedder.embed(face, frame_bytes)
            
            # 4. Identify via face search
            resolved_name = await self._identify_face(embedding, face_image)
            
            # 5. Create person record
            await self._db.store_person(person_id, person_data)
        
        # 6. Enrich each identified person (parallel, error-isolated)
        results = await asyncio.gather(*enrichment_tasks, return_exceptions=True)

See Capture Pipeline for detailed implementation.

2. Identification System

Location: backend/identification/ Purpose: Identify individuals from face images using multi-tiered face search. Strategy:

Tier 1: PimEyes (purpose-built face search)
Tier 2: Reverse image search (Google, Yandex, Bing) as fallback

Implementation:

class FaceSearchManager:
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        # Try PimEyes first
        pimeyes_result = await self._pimeyes.search_face(request)
        if pimeyes_result.success and pimeyes_result.matches:
            return pimeyes_result
        
        # Fallback to reverse image search
        return await self._reverse.search_face(request)

See Identification for detailed implementation.

3. Research Orchestrator

Location: backend/agents/orchestrator.py Purpose: Coordinate parallel research across multiple platforms and data sources. Two-Phase Architecture: Phase 1: Static Agents + API Enrichment

LinkedIn Agent (Voyager API)
Twitter/X Agent (GraphQL reverse engineering)
Instagram Agent (browser scraping)
Google Agent (web search)
OSINT Agent (aggregated searches)
Social Agent (username enumeration)
Exa API (structured person search)

Phase 2: Dynamic URL Scrapers

Spawned for high-value URLs discovered in Phase 1
Skips domains already covered by static agents
Limited to 3 concurrent scrapers to manage resources

Implementation:

class ResearchOrchestrator:
    async def research_person(self, request: ResearchRequest) -> OrchestratorResult:
        # Phase 1: Launch all static agents in parallel
        tasks = {agent.agent_name: asyncio.create_task(agent.run(request)) 
                 for agent in self._static_agents}
        
        # Wait for completion with timeout
        done, pending = await asyncio.wait(tasks.values(), timeout=timeout)
        
        # Phase 2: Spawn dynamic scrapers for discovered URLs
        dynamic_results = await self._run_dynamic_scrapers(request, exa_result.hits, remaining_time)

See Agent Swarm for detailed implementation.

4. Synthesis Engine

Location: backend/synthesis/ Purpose: Aggregate all gathered intelligence into a coherent, structured dossier. Dual-Engine Strategy:

Primary: Anthropic Claude (configurable)
Fallback: Google Gemini 2.0 Flash (cheaper, faster)

Output Structure:

@dataclass
class DossierReport:
    summary: str
    occupation: str | None
    organization: str | None
    work_history: list[WorkExperience]
    education: list[Education]
    social_profiles: dict[str, str]
    notable_activity: list[str]
    conversation_hooks: list[str]
    risk_flags: list[str]
    confidence_score: float

Connection Detection: After synthesis, the system automatically detects relationships between people:

Shared employers (colleagues)
Shared educational institutions (classmates)
Mutual connections on social platforms

5. Real-Time Data Layer

Location: frontend/convex/ Purpose: Power real-time UI updates as intelligence streams in from agents. Why Convex?

Zero WebSocket boilerplate
Automatic real-time subscriptions
Built-in state management
Perfect for the “streaming corkboard” effect

Schema:

// Convex schema (schema.ts)
export default defineSchema({
  persons: defineTable({
    name: v.string(),
    photoUrl: v.string(),
    confidence: v.number(),
    status: v.union(
      v.literal("identified"),
      v.literal("researching"),
      v.literal("synthesizing"),
      v.literal("complete")
    ),
    boardPosition: v.object({ x: v.number(), y: v.number() }),
    dossier: v.optional(v.object({...})),
    createdAt: v.number(),
    updatedAt: v.number(),
  })),
  
  intelFragments: defineTable({
    personId: v.id("persons"),
    source: v.string(),
    content: v.string(),
    timestamp: v.number(),
  }).index("by_person", ["personId"]),
  
  connections: defineTable({
    personAId: v.id("persons"),
    personBId: v.id("persons"),
    relationshipType: v.string(),
    description: v.string(),
  }),
});

MongoDB for Persistence:

Raw image storage via GridFS
Capture metadata and embeddings
Historical data and audit logs

6. Corkboard Frontend

Location: frontend/ Tech Stack:

Next.js 14 (App Router)
Framer Motion (animations)
Tailwind CSS (styling)
Convex React hooks (real-time subscriptions)

Real-Time Features:

// Frontend real-time subscriptions
const persons = useQuery(api.persons.listAll);
const intel = useQuery(api.intel.byPerson, { personId });
const connections = useQuery(api.connections.byPerson, { personId });

See Real-Time Streaming for detailed implementation.

End-to-End Data Flow

Capture

Camera input → Frame extraction → Face detection → MongoDB (raw image) + Convex (capture record)

Identify

PimEyes face search → Name extraction → Convex (person record, status: “identified”)Frontend: Paper spawns on corkboard with photo + name

Research

Orchestrator spawns 6+ agents in parallel → Each agent extracts data → Convex (intel fragments stream in)Frontend: Paper updates in real-time as data arrives

Synthesize

LLM aggregates all fragments → Structured dossier → Convex (person.dossier updated, status: “complete”)Frontend: Paper fully revealed, connections drawn

Notify

Complete dossier available → Live activity feed updated

Technology Stack

Layer	Technology	Rationale
Backend	Python FastAPI	Browser Use SDK is Python, async support
Face Detection	MediaPipe	5-10ms/frame, 95%+ accuracy, zero C++ deps
Face Embeddings	InsightFace ArcFace	512-dim embeddings, state-of-the-art matching
Agent Orchestration	Browser Use API	Hackathon sponsor, required for demo
LLM (Primary)	Anthropic Claude	High-quality synthesis, structured output
LLM (Fallback)	Gemini 2.0 Flash	25x cheaper than GPT-4o, 2x faster
Real-Time DB	Convex	Zero WebSocket boilerplate, auto-subscriptions
Persistent DB	MongoDB Atlas	Free cluster, GridFS for images
Research API	Exa	200ms person search, structured results
Frontend	Next.js + Framer Motion	Fast dev, great animations, Vercel deploy
Observability	Laminar	Agent tracing, accuracy verification
Memory	SuperMemory	Cross-session dossier caching

Design Decisions

Why Python backend instead of full JavaScript stack?The Browser Use SDK is Python-native, and the existing codebase was Python. Rewriting would waste hackathon time. Python’s async/await support makes it ideal for coordinating parallel agents.

Why Convex over raw WebSockets?Convex eliminates WebSocket boilerplate entirely. Real-time subscriptions work out of the box. In a 24-hour hackathon, speed of development trumps flexibility. The streaming corkboard effect requires instant updates as each agent completes.

Why multiple LLM engines?Dual-engine strategy provides resilience. If the primary engine fails (rate limits, API issues), the fallback engine ensures synthesis always completes. Gemini 2.0 Flash is 25x cheaper than GPT-4o, making it perfect for bulk synthesis.

Hackathon Trade-offsThis architecture prioritizes demo impact over production scalability:

No authentication/authorization
Limited error recovery
Partial results acceptable
3-minute hard timeout per person
Streaming partial results > waiting for completion

Performance Characteristics

Typical Timeline (per person):

Face detection: 5-10ms per frame
Face search (PimEyes): 10-30 seconds
Static agents (Phase 1): 20-60 seconds (parallel)
Dynamic scrapers (Phase 2): 10-30 seconds (parallel)
Synthesis: 5-10 seconds
Total: 45-130 seconds from capture to complete dossier

Scaling:

Multiple faces per frame: processed in sequence (not parallelized)
Multiple people: enrichment runs in parallel (error-isolated)
Agent failures: logged but never block pipeline completion

Observability

Laminar Tracing: Every LLM call and agent run is traced:

from observability.laminar import traced

@traced("pipeline.identify_face")
async def _identify_face(self, embedding: list[float], image_data: bytes):
    # Traced execution
    search_result = await self._face_searcher.search_face(search_request)

Dashboard Features:

Real-time agent status
LLM call latency and token usage
Error tracking and retry logic
Accuracy verification for synthesis

Next Steps

Capture Pipeline

Deep dive into frame extraction, face detection, and pipeline orchestration

Identification

How PimEyes and reverse image search identify people from faces

Agent Swarm

Two-phase research orchestration with parallel agent execution

Real-Time Streaming

Convex subscriptions powering the live corkboard interface

Get Started

Core Concepts

Hardware Integration

Backend Services

Agent System

Frontend

Data & Storage

Deployment

System Architecture

Overview

High-Level Architecture

Core Components

1. Capture Pipeline

2. Identification System

3. Research Orchestrator

4. Synthesis Engine

5. Real-Time Data Layer

6. Corkboard Frontend

End-to-End Data Flow

Technology Stack

Design Decisions

Performance Characteristics

Observability

Next Steps

Capture Pipeline

Identification

Agent Swarm

Real-Time Streaming

Build docs developers (and LLMs) love

Get Started

Core Concepts

Hardware Integration

Backend Services

Agent System

Frontend

Data & Storage

Deployment

​Overview

​High-Level Architecture

​Core Components

​1. Capture Pipeline

​2. Identification System

​3. Research Orchestrator

​4. Synthesis Engine

​5. Real-Time Data Layer

​6. Corkboard Frontend

​End-to-End Data Flow

​Technology Stack

​Design Decisions

​Performance Characteristics

​Observability

​Next Steps

Capture Pipeline

Identification

Agent Swarm

Real-Time Streaming

Build docs developers (and LLMs) love

Overview

High-Level Architecture

Core Components

1. Capture Pipeline

2. Identification System

3. Research Orchestrator

4. Synthesis Engine

5. Real-Time Data Layer

6. Corkboard Frontend

End-to-End Data Flow

Technology Stack

Design Decisions

Performance Characteristics

Observability

Next Steps