Overview
The face recognition system works in two distinct phases:- Enrollment phase: Users select images and label them with person names. The app extracts face embeddings and stores them in a local vector database.
- Recognition phase: The camera captures frames in real-time, extracts face embeddings, and matches them against the stored database.
Face recognition workflow
Step-by-step process
1. Face detection
When a user selects an image or the camera captures a frame, the app uses either MLKit or Mediapipe to detect faces:- MLKit FaceDetector: Google’s on-device face detection with fast and accurate modes
- Mediapipe FaceDetector: Uses the BlazeFace short-range model for lightweight detection
The app validates that bounding boxes fit within image dimensions before cropping to prevent errors.
2. Face embedding generation
Each detected face is cropped and processed through the FaceNet model:- Input: 160×160 RGB face image
- Processing: Image is normalized (pixel values divided by 255)
- Output: 512-dimensional embedding vector (or 128D with the alternate model)
M represents the FaceNet model function.
3. Storage (enrollment)
During enrollment, each embedding is stored in ObjectBox with metadata:4. Vector search (recognition)
When recognizing faces in camera frames:- Extract embedding from detected face (query vector)
- Search vector database for nearest neighbor
- Retrieve top candidate with highest similarity
- Re-compute cosine similarity for precision
- Apply threshold to determine match
The app re-computes cosine similarity because ObjectBox performs lossy compression on embeddings, making the returned distance an estimate.
5. Similarity comparison
The system uses cosine similarity to compare embeddings:- Higher threshold (e.g., 0.5): Fewer false positives, more false negatives
- Lower threshold (e.g., 0.2): More false positives, fewer false negatives
6. Spoof detection (optional)
For matched faces, the system optionally runs anti-spoofing detection using MiniFASNet:- Processes face at two different scales (2.7× and 4.0×)
- Detects whether the face is real or a photo/video spoof
- Combines outputs using softmax averaging
Performance metrics
The app tracks latency for each operation:- Face detection: ~20-50ms per frame
- Embedding generation: ~30-100ms per face
- Vector search: ~5-20ms (ANN) or 50-200ms (flat search)
- Spoof detection: ~40-80ms per face
Search modes
Approximate Nearest Neighbor (default)
- Uses ObjectBox’s HNSW index
- Fast but may not return the true nearest neighbor
- Good for real-time applications with many stored embeddings
Flat search (precise)
- Linear scan through all embeddings
- Guarantees true nearest neighbor
- Parallelized across 4 coroutines for better performance
- Recommended for higher accuracy requirements
Enable flat search in
FaceDetectionOverlay.kt by setting flatSearch = true. This is slower but provides better recognition accuracy, especially with larger databases.Mathematical foundation
FaceNet is trained using triplet loss, which ensures:f(x)is the embedding function- Anchor and positive are the same person
- Anchor and negative are different people
αis the margin that enforces separation