What are face embeddings?
A face embedding is a mathematical representation of a face as a vector of numbers. FaceNet produces either:- 128-dimensional embedding:
facenet.tflitemodel - 512-dimensional embedding:
facenet_512.tflitemodel (default)
Why embeddings matter
Raw face images cannot be easily compared:- Different lighting conditions
- Various angles and poses
- Changing facial expressions
- Different image resolutions
- Normalizing variations into a consistent space
- Capturing invariant facial features
- Enabling fast mathematical comparison
- Compressing images into compact vectors
A 512-dimensional embedding (2 KB) is much smaller than a 160×160 RGB image (76 KB), yet contains the essential identity information.
FaceNet model
The app uses FaceNet, a deep convolutional neural network trained with triplet loss.Model specifications
| Property | Value |
|---|---|
| Input size | 160 × 160 × 3 (RGB) |
| Output size | 512 floats (or 128) |
| Format | TFLite with FP16 quantization |
| Source | deepface library |
| Architecture | Inception ResNet v1 |
| File size | ~23 MB (512D), ~23 MB (128D) |
Triplet loss training
FaceNet is trained using triplet loss to learn discriminative embeddings:f(x)= embedding functiona= anchor imagep= positive image (same person as anchor)n= negative image (different person)α= margin (separation between positive and negative pairs)
- Embeddings of the same person are close together
- Embeddings of different people are far apart
- Minimum margin
αseparates positive and negative pairs
The model is pre-trained and not modified by the app. All learning happens during training by the original authors.
Implementation
TheFaceNet class wraps the TFLite model:
Initialization
The model is loaded once when the app starts:Hardware acceleration
The app supports multiple acceleration options:- GPU Delegate: Runs inference on GPU if available (~3-5× faster)
- NNAPI: Uses Android Neural Networks API for hardware acceleration
- XNNPACK: Optimized CPU inference for ARM processors
- CPU-only: Falls back to 4 threads if no acceleration available
Generating embeddings
The main method processes a face bitmap and returns an embedding:Step-by-step process
1. Image preprocessing Convert the cropped face bitmap to a tensor:imageTensorProcessor applies:
- Resize: Scale to 160×160 using bilinear interpolation
- Normalize: Divide pixel values by 255 (0-255 → 0.0-1.0)
- FaceNet was trained on normalized images
- Ensures consistent input distribution
- Improves numerical stability
- Takes preprocessed image buffer as input
- Runs forward pass through neural network
- Returns 512-dimensional float array
FloatArray and stored in ObjectBox:
Embedding properties
Dimensionality
Embeddings live in a 512-dimensional space (or 128D):Normalization
While not L2-normalized by default, embeddings have bounded magnitude due to network architecture.Similarity metric
The app uses cosine similarity to compare embeddings:- 1.0: Identical vectors (same person, identical image)
- 0.6-0.8: Very similar (same person, different images)
- 0.3-0.5: Somewhat similar (threshold region)
- <0.3: Different people
The app uses a threshold of 0.3 to determine matches. Cosine similarity above 0.3 indicates the same person.
Switching models
To use the 128-dimensional model instead:1. Change model path in FaceNet.kt
2. Update embedding dimension
3. Update database schema in DataModels.kt
Performance characteristics
Latency
Typical embedding generation times:| Device | GPU | CPU (4 threads) |
|---|---|---|
| High-end | 25-35ms | 80-100ms |
| Mid-range | 35-50ms | 100-150ms |
| Low-end | 50-80ms | 150-250ms |
Memory
Model memory footprint:- Loaded model: ~90 MB in RAM
- Intermediate tensors: ~15 MB during inference
- Single embedding: 2 KB (512 floats × 4 bytes)
Accuracy
128D vs 512D models:- 512D: Better accuracy, especially with large databases (>100 people)
- 128D: Slightly faster inference, smaller storage, good for small databases
Quality factors
Embedding quality depends on input image: Good inputs:- Frontal face view (±15° rotation)
- Good lighting (evenly lit face)
- Minimal occlusions (no sunglasses/masks)
- Clear image (not blurry)
- Neutral or slight expression
- Profile views (>45° rotation)
- Harsh shadows or backlighting
- Partial occlusions
- Motion blur
- Extreme expressions
Embedding storage
Embeddings are stored in ObjectBox with HNSW indexing:- Fast approximate nearest-neighbor search
- Sublinear query time complexity
- Efficient storage with lossy compression