Face Identification

Overview

The identification system takes a face image and determines who the person is by searching across multiple face recognition and reverse image search engines. The system implements a waterfall strategy with automatic fallback. Location: backend/identification/

Multi-Tiered Search Strategy

Face Image Input
  │
  ├─▶ Tier 1: PimEyes (purpose-built face search)
  │     └─▶ 10-30 second search
  │     └─▶ Returns: URLs where face appears
  │     └─▶ Success: Return results ✓
  │     └─▶ Failure: Continue to Tier 2 ↓
  │
  └─▶ Tier 2: Reverse Image Search (Google, Yandex, Bing)
        └─▶ 5-10 second search across 3 engines
        └─▶ Returns: Similar images + context
        └─▶ Success: Return results ✓
        └─▶ Failure: Return no match ✗

Implementation

Face Search Manager

Location: backend/identification/search_manager.py

class FaceSearchManager:
    """Orchestrates face search across PimEyes and reverse image engines."""
    
    def __init__(self, settings: Settings):
        self._pimeyes = PimEyesSearcher(settings)
        self._reverse = ReverseImageSearcher()
    
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        # Tier 1: PimEyes (purpose-built face search)
        logger.info("Face search: trying PimEyes first")
        pimeyes_result = await self._pimeyes.search_face(request)
        
        if pimeyes_result.success and pimeyes_result.matches:
            logger.info("PimEyes found {} matches, skipping reverse search",
                        len(pimeyes_result.matches))
            return pimeyes_result
        
        # Tier 2: Reverse image search (Google, Yandex, Bing)
        logger.info("PimEyes returned no matches, falling back to reverse search")
        reverse_result = await self._reverse.search_face(request)
        
        if reverse_result.success and reverse_result.matches:
            logger.info("Reverse search found {} matches", len(reverse_result.matches))
            return reverse_result
        
        # Both failed — merge errors
        return FaceSearchResult(
            matches=[],
            success=False,
            error=f"PimEyes: {pimeyes_result.error} | ReverseSearch: {reverse_result.error}",
        )
    
    def best_name_from_results(self, result: FaceSearchResult) -> str | None:
        """Extract most likely person name using frequency analysis."""
        if not result.matches:
            return None
        
        name_counts = {}
        for match in result.matches:
            if match.person_name:
                name = match.person_name.strip()
                name_counts[name] = name_counts.get(name, 0) + 1
        
        if not name_counts:
            return None
        
        # Return the most frequent name
        return max(name_counts, key=name_counts.get)

Data Models

@dataclass
class FaceSearchRequest:
    embedding: list[float]  # 512-dim ArcFace embedding
    image_data: bytes       # Raw face image for visual search

@dataclass
class FaceMatch:
    url: str                   # URL where face was found
    person_name: str | None    # Extracted person name
    thumbnail_url: str | None  # Thumbnail image
    source: str               # "pimeyes" | "google" | "yandex" | "bing"
    confidence: float         # 0-1 similarity score

@dataclass
class FaceSearchResult:
    matches: list[FaceMatch]
    success: bool
    error: str | None = None

Tier 1: PimEyes Search

Location: backend/identification/pimeyes.py PimEyes is a specialized facial recognition search engine. It’s the most accurate for face-based searches but requires careful anti-bot measures.

Implementation Strategy

class PimEyesSearcher:
    """Searches PimEyes for face matches using Browser Use."""
    
    def __init__(self, settings: Settings):
        self._accounts = self._load_accounts(settings.pimeyes_accounts)
        self._current_idx = 0
        self._lock = asyncio.Lock()
        self._gemini = genai.GenerativeModel("gemini-2.0-flash")
    
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        account = await self._rotate_account()
        
        # Browser Use agent to navigate PimEyes
        browser = Browser(config=BrowserConfig(headless=True))
        agent = Agent(
            task=f"""
            1. Go to https://pimeyes.com
            2. Log in with email: {account['email']}
            3. Upload the provided face image for search
            4. Wait for results to load (up to 30 seconds)
            5. Take a screenshot of the full results page
            """,
            llm="gemini-2.0-flash",
            browser=browser,
        )
        
        result = await agent.run()
        screenshot = result.screenshot()
        await browser.close()
        
        # Parse screenshot with Gemini
        identity = await self._parse_results(screenshot)
        return self._to_face_search_result(identity)
    
    async def _parse_results(self, screenshot_bytes: bytes) -> dict:
        """Use Gemini 2.0 Flash to extract identity from PimEyes results."""
        image_b64 = base64.b64encode(screenshot_bytes).decode()
        
        response = self._gemini.generate_content([
            {"mime_type": "image/png", "data": image_b64},
            """
            Analyze this PimEyes search results page. Extract:
            1. The most likely name of the person
            2. All URLs shown in results (social media, websites, articles)
            3. Context clues (company, title, location) visible in results
            4. Confidence level (high/medium/low) based on match quality
            
            Return ONLY valid JSON:
            {
                "name": "string or null",
                "urls": ["url1", "url2"],
                "context": {"company": "", "title": "", "location": ""},
                "confidence": "high|medium|low",
                "match_count": number
            }
            """
        ])
        
        return json.loads(response.text)

Account Rotation & Rate Limiting

Anti-Bot StrategyPimEyes actively blocks automated searches. The system uses account pooling and rate limiting to avoid detection:

Pool of 3-5 PimEyes accounts
Round-robin selection per search
Minimum 30 seconds between uses per account
Exponential backoff on detected blocks

class AccountPool:
    """Manages PimEyes account rotation with rate limiting."""
    
    MIN_INTERVAL = 30  # seconds between uses per account
    
    async def get_account(self) -> dict:
        async with self._lock:
            now = time.time()
            for acc in self._accounts:
                if not acc["blocked"] and (now - acc["last_used"]) > self.MIN_INTERVAL:
                    acc["last_used"] = now
                    return acc
            
            # All accounts busy — wait for shortest cooldown
            wait_times = [
                self.MIN_INTERVAL - (now - acc["last_used"])
                for acc in self._accounts if not acc["blocked"]
            ]
            if wait_times:
                await asyncio.sleep(min(wait_times))
                return await self.get_account()
            
            raise RuntimeError("All PimEyes accounts blocked")
    
    def mark_blocked(self, email: str):
        for acc in self._accounts:
            if acc["email"] == email:
                acc["blocked"] = True

Vision LLM Parsing

Why Gemini 2.0 Flash?

25x cheaper than GPT-4o Vision
2x faster response time
Excellent at screenshot OCR and layout understanding
JSON output reliability

Example Extraction: Input: PimEyes results page screenshot Output:

{
  "name": "Jane Smith",
  "urls": [
    "https://linkedin.com/in/janesmith",
    "https://twitter.com/janesmith",
    "https://company.com/team/jane-smith"
  ],
  "context": {
    "company": "TechCorp",
    "title": "Senior Engineer",
    "location": "San Francisco"
  },
  "confidence": "high",
  "match_count": 47
}

Tier 2: Reverse Image Search

Location: backend/identification/reverse_search.py Fallback strategy using multiple reverse image search engines in parallel.

Multi-Engine Search

class ReverseImageSearcher:
    """Searches face across Google, Yandex, and Bing reverse image engines."""
    
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        # Launch all engines in parallel
        tasks = [
            self._search_google(request.image_data),
            self._search_yandex(request.image_data),
            self._search_bing(request.image_data),
        ]
        
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Merge results from all engines
        all_matches = []
        for result in results:
            if isinstance(result, Exception):
                logger.warning("Reverse search engine failed: {}", result)
                continue
            if result:
                all_matches.extend(result)
        
        return FaceSearchResult(
            matches=self._deduplicate(all_matches),
            success=len(all_matches) > 0,
            error=None if all_matches else "No matches across any engine",
        )

Google Lens Integration

async def _search_google(self, image_data: bytes) -> list[FaceMatch]:
    """Search via Google Lens using Browser Use."""
    browser = Browser(config=BrowserConfig(headless=True))
    agent = Agent(
        task="""
        1. Go to https://lens.google.com
        2. Upload the provided face image
        3. Look for matches with names or social profiles
        4. Extract: URL, person name (if visible), thumbnail
        5. Return as JSON array
        """,
        llm="gemini-2.0-flash",
        browser=browser,
    )
    
    result = await agent.run()
    await browser.close()
    
    parsed = json.loads(result.final_result())
    return [
        FaceMatch(
            url=m["url"],
            person_name=m.get("name"),
            thumbnail_url=m.get("thumbnail"),
            source="google",
            confidence=0.7,  # Google doesn't provide scores
        )
        for m in parsed
    ]

Engine Characteristics

Engine	Speed	Accuracy	Rate Limits	Best For
Google Lens	5-10s	High	None	Social profiles, public figures
Yandex Images	3-8s	Medium	None	Russian/European content
Bing Visual	5-12s	Medium	None	General web content

Why not use all engines for every search?PimEyes is optimized specifically for faces and typically returns better results than generic reverse image search. Running all engines sequentially would add 15-30 seconds per search with diminishing returns. The waterfall strategy provides the best balance of speed and accuracy.

Name Extraction

Frequency-Based Selection

When multiple search results contain different names, the system uses frequency analysis:

def best_name_from_results(self, result: FaceSearchResult) -> str | None:
    """Extract most likely person name using frequency analysis."""
    name_counts = {}
    for match in result.matches:
        if match.person_name:
            name = match.person_name.strip()
            name_counts[name] = name_counts.get(name, 0) + 1
    
    if not name_counts:
        return None
    
    # Return the most frequent name
    return max(name_counts, key=name_counts.get)

Example: Search returns:

15 URLs mention “Jane Smith”
3 URLs mention “J. Smith”
1 URL mentions “Jane S.”

→ System selects “Jane Smith” (highest frequency)

def profile_urls_from_results(self, result: FaceSearchResult) -> list[str]:
    """Extract social profile URLs from search results."""
    social_domains = {
        "linkedin.com", "twitter.com", "x.com", "instagram.com",
        "facebook.com", "github.com", "tiktok.com",
    }
    urls = []
    seen = set()
    for match in result.matches:
        if match.url and match.url not in seen:
            if any(domain in match.url for domain in social_domains):
                urls.append(match.url)
                seen.add(match.url)
    return urls

Purpose: Feed discovered social profiles directly to the research orchestrator, skipping the agent’s own discovery phase.

Integration with Pipeline

Usage in Capture Pipeline

from backend/pipeline.py

@traced("pipeline.identify_face")
async def _identify_face(
    self, embedding: list[float], image_data: bytes
) -> str | None:
    if not self._face_searcher:
        return None
    
    try:
        search_request = FaceSearchRequest(
            embedding=embedding,
            image_data=image_data,
        )
        search_result = await self._face_searcher.search_face(search_request)
        
        if not search_result.success:
            logger.warning("Face search failed: {}", search_result.error)
            return None
        
        name = self._face_searcher.best_name_from_results(search_result)
        if name:
            logger.info("Face search identified person as: {}", name)
        else:
            logger.info("Face search found {} matches but no name", 
                       len(search_result.matches))
        
        return name
    
    except Exception as exc:
        logger.error("Face search crashed: {}", exc)
        return None

Error Handling

Graceful DegradationIdentification failures never block the pipeline:

Face search timeout → Person created with “Unknown” name
All engines fail → Person marked as “detected” not “identified”
Partial results → Best available name used
Empty results → Pipeline continues with manual enrichment

Performance Characteristics

Typical Timeline:

Scenario	Duration	Success Rate
PimEyes hit	10-30s	60-70%
Reverse search hit	15-35s	20-30%
No match found	35-45s	10-20%

Success Rate Factors:

Public figures: 80-90% success
Active social media users: 60-70% success
Private individuals: 20-30% success
Children/obscured faces: Less than 10% success

Observability

Laminar Tracing:

pipeline.identify_face
├── face_search_manager.search_face
│   ├── pimeyes.search_face
│   │   ├── browser_use.navigate
│   │   ├── browser_use.upload_image
│   │   ├── browser_use.screenshot
│   │   └── gemini.parse_results
│   └── reverse_search.search_face (on PimEyes failure)
│       ├── reverse_search.google
│       ├── reverse_search.yandex
│       └── reverse_search.bing
└── extract_name

Metrics Tracked:

Search duration per engine
Match count per engine
Name extraction confidence
Error rates and types
Account rotation status

Configuration

Environment Variables:

# config.py
class Settings(BaseSettings):
    # PimEyes account pool (JSON array)
    pimeyes_accounts: str = "[]"
    # Example: '[{"email":"[email protected]","password":"xxx"}]'
    
    # Face search timeouts
    pimeyes_timeout_seconds: int = 30
    reverse_search_timeout_seconds: int = 15
    
    # Vision LLM for parsing
    google_gemini_api_key: str

Account Setup:

Create PimEyes Accounts

Configure Account Pool

Add to .env:

PIMEYES_ACCOUNTS='[
  {"email":"[email protected]","password":"password1"},
  {"email":"[email protected]","password":"password2"}
]'

Test Search

Run a test search to verify accounts work:

python -m identification.test_search --image test_face.jpg

Get Started

Core Concepts

Hardware Integration

Backend Services

Agent System

Frontend

Data & Storage

Deployment

Face Identification

Overview

Multi-Tiered Search Strategy

Implementation

Face Search Manager

Data Models

Tier 1: PimEyes Search

Implementation Strategy

Account Rotation & Rate Limiting

Vision LLM Parsing

Tier 2: Reverse Image Search

Multi-Engine Search

Google Lens Integration

Engine Characteristics

Name Extraction

Frequency-Based Selection

Integration with Pipeline

Usage in Capture Pipeline

Error Handling

Performance Characteristics

Observability

Configuration

Next Steps

Capture Pipeline

Agent Swarm

Build docs developers (and LLMs) love

Get Started

Core Concepts

Hardware Integration

Backend Services

Agent System

Frontend

Data & Storage

Deployment

​Overview

​Multi-Tiered Search Strategy

​Implementation

​Face Search Manager

​Data Models

​Tier 1: PimEyes Search

​Implementation Strategy

​Account Rotation & Rate Limiting

​Vision LLM Parsing

​Tier 2: Reverse Image Search

​Multi-Engine Search

​Google Lens Integration

​Engine Characteristics

​Name Extraction

​Frequency-Based Selection

​Social Profile Extraction

​Integration with Pipeline

​Usage in Capture Pipeline

​Error Handling

​Performance Characteristics

​Observability

​Configuration

​Next Steps

Capture Pipeline

Agent Swarm

Build docs developers (and LLMs) love

Overview

Multi-Tiered Search Strategy

Implementation

Face Search Manager

Data Models

Tier 1: PimEyes Search

Implementation Strategy

Account Rotation & Rate Limiting

Vision LLM Parsing

Tier 2: Reverse Image Search

Multi-Engine Search

Google Lens Integration

Engine Characteristics

Name Extraction

Frequency-Based Selection

Social Profile Extraction

Integration with Pipeline

Usage in Capture Pipeline

Error Handling

Performance Characteristics

Observability

Configuration

Next Steps