Skip to main content

Overview

The identification system takes a face image and determines who the person is by searching across multiple face recognition and reverse image search engines. The system implements a waterfall strategy with automatic fallback. Location: backend/identification/

Multi-Tiered Search Strategy

Face Image Input

  ├─▶ Tier 1: PimEyes (purpose-built face search)
  │     └─▶ 10-30 second search
  │     └─▶ Returns: URLs where face appears
  │     └─▶ Success: Return results ✓
  │     └─▶ Failure: Continue to Tier 2 ↓

  └─▶ Tier 2: Reverse Image Search (Google, Yandex, Bing)
        └─▶ 5-10 second search across 3 engines
        └─▶ Returns: Similar images + context
        └─▶ Success: Return results ✓
        └─▶ Failure: Return no match ✗

Implementation

Face Search Manager

Location: backend/identification/search_manager.py
class FaceSearchManager:
    """Orchestrates face search across PimEyes and reverse image engines."""
    
    def __init__(self, settings: Settings):
        self._pimeyes = PimEyesSearcher(settings)
        self._reverse = ReverseImageSearcher()
    
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        # Tier 1: PimEyes (purpose-built face search)
        logger.info("Face search: trying PimEyes first")
        pimeyes_result = await self._pimeyes.search_face(request)
        
        if pimeyes_result.success and pimeyes_result.matches:
            logger.info("PimEyes found {} matches, skipping reverse search",
                        len(pimeyes_result.matches))
            return pimeyes_result
        
        # Tier 2: Reverse image search (Google, Yandex, Bing)
        logger.info("PimEyes returned no matches, falling back to reverse search")
        reverse_result = await self._reverse.search_face(request)
        
        if reverse_result.success and reverse_result.matches:
            logger.info("Reverse search found {} matches", len(reverse_result.matches))
            return reverse_result
        
        # Both failed — merge errors
        return FaceSearchResult(
            matches=[],
            success=False,
            error=f"PimEyes: {pimeyes_result.error} | ReverseSearch: {reverse_result.error}",
        )
    
    def best_name_from_results(self, result: FaceSearchResult) -> str | None:
        """Extract most likely person name using frequency analysis."""
        if not result.matches:
            return None
        
        name_counts = {}
        for match in result.matches:
            if match.person_name:
                name = match.person_name.strip()
                name_counts[name] = name_counts.get(name, 0) + 1
        
        if not name_counts:
            return None
        
        # Return the most frequent name
        return max(name_counts, key=name_counts.get)

Data Models

@dataclass
class FaceSearchRequest:
    embedding: list[float]  # 512-dim ArcFace embedding
    image_data: bytes       # Raw face image for visual search

@dataclass
class FaceMatch:
    url: str                   # URL where face was found
    person_name: str | None    # Extracted person name
    thumbnail_url: str | None  # Thumbnail image
    source: str               # "pimeyes" | "google" | "yandex" | "bing"
    confidence: float         # 0-1 similarity score

@dataclass
class FaceSearchResult:
    matches: list[FaceMatch]
    success: bool
    error: str | None = None
Location: backend/identification/pimeyes.py PimEyes is a specialized facial recognition search engine. It’s the most accurate for face-based searches but requires careful anti-bot measures.

Implementation Strategy

class PimEyesSearcher:
    """Searches PimEyes for face matches using Browser Use."""
    
    def __init__(self, settings: Settings):
        self._accounts = self._load_accounts(settings.pimeyes_accounts)
        self._current_idx = 0
        self._lock = asyncio.Lock()
        self._gemini = genai.GenerativeModel("gemini-2.0-flash")
    
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        account = await self._rotate_account()
        
        # Browser Use agent to navigate PimEyes
        browser = Browser(config=BrowserConfig(headless=True))
        agent = Agent(
            task=f"""
            1. Go to https://pimeyes.com
            2. Log in with email: {account['email']}
            3. Upload the provided face image for search
            4. Wait for results to load (up to 30 seconds)
            5. Take a screenshot of the full results page
            """,
            llm="gemini-2.0-flash",
            browser=browser,
        )
        
        result = await agent.run()
        screenshot = result.screenshot()
        await browser.close()
        
        # Parse screenshot with Gemini
        identity = await self._parse_results(screenshot)
        return self._to_face_search_result(identity)
    
    async def _parse_results(self, screenshot_bytes: bytes) -> dict:
        """Use Gemini 2.0 Flash to extract identity from PimEyes results."""
        image_b64 = base64.b64encode(screenshot_bytes).decode()
        
        response = self._gemini.generate_content([
            {"mime_type": "image/png", "data": image_b64},
            """
            Analyze this PimEyes search results page. Extract:
            1. The most likely name of the person
            2. All URLs shown in results (social media, websites, articles)
            3. Context clues (company, title, location) visible in results
            4. Confidence level (high/medium/low) based on match quality
            
            Return ONLY valid JSON:
            {
                "name": "string or null",
                "urls": ["url1", "url2"],
                "context": {"company": "", "title": "", "location": ""},
                "confidence": "high|medium|low",
                "match_count": number
            }
            """
        ])
        
        return json.loads(response.text)

Account Rotation & Rate Limiting

Anti-Bot StrategyPimEyes actively blocks automated searches. The system uses account pooling and rate limiting to avoid detection:
  • Pool of 3-5 PimEyes accounts
  • Round-robin selection per search
  • Minimum 30 seconds between uses per account
  • Exponential backoff on detected blocks
class AccountPool:
    """Manages PimEyes account rotation with rate limiting."""
    
    MIN_INTERVAL = 30  # seconds between uses per account
    
    async def get_account(self) -> dict:
        async with self._lock:
            now = time.time()
            for acc in self._accounts:
                if not acc["blocked"] and (now - acc["last_used"]) > self.MIN_INTERVAL:
                    acc["last_used"] = now
                    return acc
            
            # All accounts busy — wait for shortest cooldown
            wait_times = [
                self.MIN_INTERVAL - (now - acc["last_used"])
                for acc in self._accounts if not acc["blocked"]
            ]
            if wait_times:
                await asyncio.sleep(min(wait_times))
                return await self.get_account()
            
            raise RuntimeError("All PimEyes accounts blocked")
    
    def mark_blocked(self, email: str):
        for acc in self._accounts:
            if acc["email"] == email:
                acc["blocked"] = True

Vision LLM Parsing

Why Gemini 2.0 Flash?
  • 25x cheaper than GPT-4o Vision
  • 2x faster response time
  • Excellent at screenshot OCR and layout understanding
  • JSON output reliability
Example Extraction: Input: PimEyes results page screenshot Output:
{
  "name": "Jane Smith",
  "urls": [
    "https://linkedin.com/in/janesmith",
    "https://twitter.com/janesmith",
    "https://company.com/team/jane-smith"
  ],
  "context": {
    "company": "TechCorp",
    "title": "Senior Engineer",
    "location": "San Francisco"
  },
  "confidence": "high",
  "match_count": 47
}
Location: backend/identification/reverse_search.py Fallback strategy using multiple reverse image search engines in parallel.
class ReverseImageSearcher:
    """Searches face across Google, Yandex, and Bing reverse image engines."""
    
    async def search_face(self, request: FaceSearchRequest) -> FaceSearchResult:
        # Launch all engines in parallel
        tasks = [
            self._search_google(request.image_data),
            self._search_yandex(request.image_data),
            self._search_bing(request.image_data),
        ]
        
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Merge results from all engines
        all_matches = []
        for result in results:
            if isinstance(result, Exception):
                logger.warning("Reverse search engine failed: {}", result)
                continue
            if result:
                all_matches.extend(result)
        
        return FaceSearchResult(
            matches=self._deduplicate(all_matches),
            success=len(all_matches) > 0,
            error=None if all_matches else "No matches across any engine",
        )

Google Lens Integration

async def _search_google(self, image_data: bytes) -> list[FaceMatch]:
    """Search via Google Lens using Browser Use."""
    browser = Browser(config=BrowserConfig(headless=True))
    agent = Agent(
        task="""
        1. Go to https://lens.google.com
        2. Upload the provided face image
        3. Look for matches with names or social profiles
        4. Extract: URL, person name (if visible), thumbnail
        5. Return as JSON array
        """,
        llm="gemini-2.0-flash",
        browser=browser,
    )
    
    result = await agent.run()
    await browser.close()
    
    parsed = json.loads(result.final_result())
    return [
        FaceMatch(
            url=m["url"],
            person_name=m.get("name"),
            thumbnail_url=m.get("thumbnail"),
            source="google",
            confidence=0.7,  # Google doesn't provide scores
        )
        for m in parsed
    ]

Engine Characteristics

EngineSpeedAccuracyRate LimitsBest For
Google Lens5-10sHighNoneSocial profiles, public figures
Yandex Images3-8sMediumNoneRussian/European content
Bing Visual5-12sMediumNoneGeneral web content
Why not use all engines for every search?PimEyes is optimized specifically for faces and typically returns better results than generic reverse image search. Running all engines sequentially would add 15-30 seconds per search with diminishing returns. The waterfall strategy provides the best balance of speed and accuracy.

Name Extraction

Frequency-Based Selection

When multiple search results contain different names, the system uses frequency analysis:
def best_name_from_results(self, result: FaceSearchResult) -> str | None:
    """Extract most likely person name using frequency analysis."""
    name_counts = {}
    for match in result.matches:
        if match.person_name:
            name = match.person_name.strip()
            name_counts[name] = name_counts.get(name, 0) + 1
    
    if not name_counts:
        return None
    
    # Return the most frequent name
    return max(name_counts, key=name_counts.get)
Example: Search returns:
  • 15 URLs mention “Jane Smith”
  • 3 URLs mention “J. Smith”
  • 1 URL mentions “Jane S.”
→ System selects “Jane Smith” (highest frequency)

Social Profile Extraction

def profile_urls_from_results(self, result: FaceSearchResult) -> list[str]:
    """Extract social profile URLs from search results."""
    social_domains = {
        "linkedin.com", "twitter.com", "x.com", "instagram.com",
        "facebook.com", "github.com", "tiktok.com",
    }
    urls = []
    seen = set()
    for match in result.matches:
        if match.url and match.url not in seen:
            if any(domain in match.url for domain in social_domains):
                urls.append(match.url)
                seen.add(match.url)
    return urls
Purpose: Feed discovered social profiles directly to the research orchestrator, skipping the agent’s own discovery phase.

Integration with Pipeline

Usage in Capture Pipeline

from backend/pipeline.py

@traced("pipeline.identify_face")
async def _identify_face(
    self, embedding: list[float], image_data: bytes
) -> str | None:
    if not self._face_searcher:
        return None
    
    try:
        search_request = FaceSearchRequest(
            embedding=embedding,
            image_data=image_data,
        )
        search_result = await self._face_searcher.search_face(search_request)
        
        if not search_result.success:
            logger.warning("Face search failed: {}", search_result.error)
            return None
        
        name = self._face_searcher.best_name_from_results(search_result)
        if name:
            logger.info("Face search identified person as: {}", name)
        else:
            logger.info("Face search found {} matches but no name", 
                       len(search_result.matches))
        
        return name
    
    except Exception as exc:
        logger.error("Face search crashed: {}", exc)
        return None

Error Handling

Graceful DegradationIdentification failures never block the pipeline:
  • Face search timeout → Person created with “Unknown” name
  • All engines fail → Person marked as “detected” not “identified”
  • Partial results → Best available name used
  • Empty results → Pipeline continues with manual enrichment

Performance Characteristics

Typical Timeline:
ScenarioDurationSuccess Rate
PimEyes hit10-30s60-70%
Reverse search hit15-35s20-30%
No match found35-45s10-20%
Success Rate Factors:
  • Public figures: 80-90% success
  • Active social media users: 60-70% success
  • Private individuals: 20-30% success
  • Children/obscured faces: Less than 10% success

Observability

Laminar Tracing:
pipeline.identify_face
├── face_search_manager.search_face
│   ├── pimeyes.search_face
│   │   ├── browser_use.navigate
│   │   ├── browser_use.upload_image
│   │   ├── browser_use.screenshot
│   │   └── gemini.parse_results
│   └── reverse_search.search_face (on PimEyes failure)
│       ├── reverse_search.google
│       ├── reverse_search.yandex
│       └── reverse_search.bing
└── extract_name
Metrics Tracked:
  • Search duration per engine
  • Match count per engine
  • Name extraction confidence
  • Error rates and types
  • Account rotation status

Configuration

Environment Variables:
# config.py
class Settings(BaseSettings):
    # PimEyes account pool (JSON array)
    pimeyes_accounts: str = "[]"
    # Example: '[{"email":"[email protected]","password":"xxx"}]'
    
    # Face search timeouts
    pimeyes_timeout_seconds: int = 30
    reverse_search_timeout_seconds: int = 15
    
    # Vision LLM for parsing
    google_gemini_api_key: str
Account Setup:
1

Create PimEyes Accounts

Register 3-5 PimEyes accounts with different email addresses
2

Configure Account Pool

Add to .env:
PIMEYES_ACCOUNTS='[
  {"email":"[email protected]","password":"password1"},
  {"email":"[email protected]","password":"password2"}
]'
3

Test Search

Run a test search to verify accounts work:
python -m identification.test_search --image test_face.jpg

Next Steps

Capture Pipeline

How identification integrates with frame extraction and embedding

Agent Swarm

How identified names feed into research orchestration

Build docs developers (and LLMs) love