Skip to main content

POST /api/capture

Upload an image or video file for face detection, identification, and enrichment processing.

Authentication

No authentication required (for hackathon demo).

Request

file
file
required
Image or video file to process. Supports JPEG, PNG, MP4, and other common formats.
source
string
default:"manual_upload"
Source identifier for tracking. Common values:
  • manual_upload - Web interface upload
  • glasses_stream - Meta glasses camera
  • telegram - Telegram bot
  • api_identify - Programmatic identification
person_name
string
Optional. Pre-identified person name to associate with the capture.

Response

Returns a queued capture object.
capture_id
string
Unique identifier for this capture session.
filename
string
Original filename of the uploaded file.
content_type
string
MIME type of the uploaded file (e.g., image/jpeg, video/mp4).
status
string
Always returns queued on successful upload. Processing happens asynchronously.
source
string
Echo of the source parameter.

Example Request

curl -X POST https://api.jarvis.local/api/capture \
  -F "[email protected]" \
  -F "source=manual_upload" \
  -F "person_name=John Smith"

Example Response

{
  "capture_id": "cap_a1b2c3d4e5f6",
  "filename": "photo.jpg",
  "content_type": "image/jpeg",
  "status": "queued",
  "source": "manual_upload"
}

Status Codes

  • 200 - File queued for processing
  • 400 - Invalid file format or missing required fields
  • 413 - File too large (typically >10MB)
  • 500 - Server error during upload

Processing Pipeline

After upload, the capture goes through:
  1. Detection - MediaPipe face detection extracts face bounding boxes
  2. Embedding - ArcFace generates 512-dimensional face embeddings
  3. Identification - Face search using PimEyes and reverse image search
  4. Enrichment - Exa API fast-pass research
  5. Deep Research - Browser Use agent swarm (LinkedIn, Twitter, Google, Crunchbase)
  6. Synthesis - Claude/Gemini generates comprehensive dossier

Notes

  • Processing is asynchronous; use WebSocket or Convex subscriptions to receive real-time updates
  • Multiple faces in a single image will create separate person records
  • Video files are sampled at 1fps for face detection
  • Failed identifications will still create a person record with partial data

POST /api/capture/frame

Process a single frame from a live video stream (optimized for glasses/camera streaming).

Authentication

No authentication required.

Request

frame
string
required
Base64-encoded JPEG image data.
timestamp
integer
required
Client-side timestamp in milliseconds since epoch.
source
string
default:"glasses_stream"
Source identifier for tracking.
target
boolean
default:false
Set to true when user is explicitly targeting someone for identification (e.g., center-frame focus).

Response

capture_id
string
Unique identifier for this frame.
detections
array
Array of face detections in the frame.
new_persons
integer
Number of new person records created from this frame.
timestamp
integer
Echo of the request timestamp.
source
string
Echo of the source parameter.

Example Request

curl -X POST https://api.jarvis.local/api/capture/frame \
  -H "Content-Type: application/json" \
  -d '{
    "frame": "/9j/4AAQSkZJRgABAQEA...",
    "timestamp": 1709654400000,
    "source": "glasses_stream",
    "target": false
  }'

Example Response

{
  "capture_id": "frame_abc123xyz",
  "detections": [
    {
      "bbox": [120, 80, 280, 240],
      "confidence": 0.94,
      "track_id": 1
    },
    {
      "bbox": [400, 100, 560, 260],
      "confidence": 0.88,
      "track_id": 2
    }
  ],
  "new_persons": 1,
  "timestamp": 1709654400000,
  "source": "glasses_stream"
}

Status Codes

  • 200 - Frame processed successfully
  • 400 - Invalid base64 data or missing required fields
  • 500 - Server error during processing

Tracking Behavior

  • YOLO assigns persistent track_id values across frames
  • Same person tracked across frames shares the same track_id
  • Identification is triggered once per track (not every frame)
  • Setting target=true prioritizes that frame for identification

Performance

  • Average processing time: 50-100ms per frame
  • Supports 10-30 fps streaming
  • Face detection is cached for 500ms per track to reduce load

Build docs developers (and LLMs) love