Skip to main content

Overview

Face detection is the critical first stage of the vital signs monitoring pipeline. The system identifies the facial region in each video frame and extracts a stable Region of Interest (ROI) that serves as input to the EVM processing chain.
The system supports four different detection backends (Haar Cascade, MTCNN, YOLO, MediaPipe) with unified interface and built-in temporal stabilization to reduce jitter.

FaceDetector Architecture

The FaceDetector class provides a unified interface to multiple detection models with integrated stabilization:
class FaceDetector:
    """
    Unified face detector manager with ROI stabilization.
    
    Attributes:
        MODELS (dict): Available face detection models.
        detector (BaseFaceDetector): Current detector instance.
        roi_history (deque): History of recent ROI detections for stabilization.
        stabilized_roi (tuple or None): Current stabilized ROI coordinates.
    """
    
    MODELS = {
        "haar": HaarCascadeDetector,        # Haar Cascade classifier
        "mtcnn": MTCNNDetector,             # Multi-task Cascaded CNN
        "yolo": YOLODetector,               # YOLO-based detector
        "mediapipe": MediaPipeDetector      # MediaPipe Face Detection
    }
Source: src/face_detector/manager.py:13-41

Detection Backends

Haar Cascade

Type: Classical computer visionPros:
  • Extremely fast (CPU-only)
  • No dependencies on deep learning frameworks
  • Works well for frontal faces
Cons:
  • Lower accuracy on rotated faces
  • Sensitive to lighting conditions
  • More false positives
Best for: Resource-constrained devices, real-time processing

MTCNN

Type: Multi-stage CNNPros:
  • High accuracy
  • Handles rotation and scale variations
  • Includes facial landmark detection
Cons:
  • Slower than Haar
  • Requires more CPU/GPU resources
Best for: Accuracy-critical applications

YOLO

Type: Single-shot detectorPros:
  • Excellent speed/accuracy trade-off
  • Robust to occlusions
  • Supports YOLOv8 and YOLOv12 models
Cons:
  • Requires model weights file
  • Higher memory footprint
Best for: Production deployments

MediaPipe

Type: Google’s ML solutionPros:
  • Optimized for real-time performance
  • Cross-platform support
  • Built-in face mesh capability
Cons:
  • Requires MediaPipe framework
  • Less customizable
Best for: Mobile and web applications

Model Configuration

YOLO models require pre-trained weights specified in configuration:
# YOLO model paths
YOLO_MODELS = {
    "yolov8n": "src/weights_models/yolov8n-face.pt",   # YOLOv8 nano
    "yolov12n": "src/weights_models/yolov12n-face.pt"  # YOLOv12 nano
}
Source: src/config.py:25-29

ROI Stabilization

Raw face detection results often exhibit frame-to-frame jitter due to detection uncertainty, head motion, and algorithmic noise. The stabilization system addresses this through temporal smoothing.

Stabilization Buffer

def __init__(self, model_type="haar", **kwargs):
    # Initialize the selected detector
    self.detector = self.MODELS[model_type](**kwargs)
    
    # ROI stabilization buffers
    self.roi_history = deque(maxlen=5)  # Store last 5 detections
    self.stabilized_roi = None          # Current stabilized ROI
Source: src/face_detector/manager.py:43-55 The system maintains a sliding window of the 5 most recent detections using a deque for efficient O(1) append and pop operations.

Weighted Averaging

The stabilization algorithm applies weighted averaging with exponentially increasing weights for recent frames:
def stabilize_roi(self, new_roi):
    """
    Apply temporal smoothing to ROI using weighted averaging.
    
    Uses a weighted average of recent ROIs to reduce jitter in detection.
    Recent detections are given higher weight according to ROI_WEIGHTS.
    """
    if new_roi is None:
        # Return previous stabilized ROI if no new detection
        return self.stabilized_roi

    # Add new detection to history
    self.roi_history.append(new_roi)

    # Need minimum history for stabilization
    if len(self.roi_history) < 3:
        return new_roi
    
    # Calculate weighted average of historical ROIs
    weights = ROI_WEIGHTS[:len(self.roi_history)]
    weights = [w / sum(weights) for w in weights]  # Normalize weights

    # Weighted average calculation for each dimension
    x_roi = int(sum(r[0] * w for r, w in zip(self.roi_history, weights)))
    y_roi = int(sum(r[1] * w for r, w in zip(self.roi_history, weights)))
    w_roi = int(sum(r[2] * w for r, w in zip(self.roi_history, weights)))
    h_roi = int(sum(r[3] * w for r, w in zip(self.roi_history, weights)))

    # Update and return stabilized ROI
    self.stabilized_roi = (x_roi, y_roi, w_roi, h_roi)
    return self.stabilized_roi
Source: src/face_detector/manager.py:57-93

Stabilization Weights

The weighting scheme prioritizes recent detections while still considering historical context:
ROI_WEIGHTS = [0.1, 0.15, 0.2, 0.25, 0.3]  # Weights for ROI smoothing
Source: src/config.py:33
The weights [0.1, 0.15, 0.2, 0.25, 0.3] provide:
  • Responsiveness: 30% weight on newest frame allows tracking of head movement
  • Stability: 70% weight on historical frames smooths out detection jitter
  • Linear increase: Simple pattern balances past and present
Alternative schemes considered:
  • Exponential weights: [0.05, 0.10, 0.15, 0.25, 0.45] (more responsive, less stable)
  • Uniform weights: [0.2, 0.2, 0.2, 0.2, 0.2] (more stable, less responsive)
  • The current linear scheme provides the best balance for vital signs monitoring

Change Detection Threshold

To avoid unnecessary updates for minimal changes, the system includes a significance threshold:
def is_significant_change(self, old_roi, new_roi):
    """
    Check if ROI change exceeds threshold to avoid unnecessary updates.
    
    Returns:
        bool: True if change is significant, False otherwise.
    """
    if old_roi is None or new_roi is None:
        return True

    # Extract coordinates
    x1, y1, w1, h1 = old_roi
    x2, y2, w2, h2 = new_roi

    # Calculate absolute differences
    dx = abs(x1 - x2)
    dy = abs(y1 - y2)
    dw = abs(w1 - w2)
    dh = abs(h1 - h2)

    # Check if any dimension change exceeds threshold
    return (dx > ROI_CHANGE_THRESHOLD or dy > ROI_CHANGE_THRESHOLD or 
            dw > ROI_CHANGE_THRESHOLD or dh > ROI_CHANGE_THRESHOLD)
Source: src/face_detector/manager.py:132-158 The threshold is configured to ignore small fluctuations:
ROI_CHANGE_THRESHOLD = 20  # Pixels
Source: src/config.py:32
A 20-pixel threshold means that detection variations smaller than 20 pixels in any dimension (x, y, width, height) are ignored, preventing micro-jitter while still tracking genuine head movements.

Face Detection Pipeline

The complete detection process with stabilization:
def detect_face(self, frame):
    """
    Detect face in frame with stabilization and bounds checking.
    
    Returns:
        tuple or None: Stabilized ROI coordinates (x, y, w, h) or None.
    """
    # Get detection from underlying detector
    roi = self.detector.detect(frame)
    if roi is None:
        return None

    # Skip stabilization if change is not significant
    if self.stabilized_roi and not self.is_significant_change(self.stabilized_roi, roi):
        return self.stabilized_roi
    
    # Apply stabilization to new detection
    stable = self.stabilize_roi(roi)
    if stable:
        sx, sy, sw, sh = stable
        
        # Ensure ROI stays within frame boundaries
        sx = max(0, sx)
        sy = max(0, sy)
        sw = min(frame.shape[1] - sx, sw)
        sh = min(frame.shape[0] - sy, sh)
        
        return (sx, sy, sw, sh)
    
    return None
Source: src/face_detector/manager.py:95-130
1

Raw Detection

Call the underlying detector’s detect() method to get initial ROI.
2

Significance Check

Compare new detection with previous stable ROI. If change is below threshold, return previous ROI without updating.
3

Weighted Stabilization

If change is significant, add to history buffer and compute weighted average.
4

Boundary Clipping

Ensure ROI coordinates stay within frame boundaries to prevent out-of-bounds errors.

ROI Extraction and Sizing

Once the face is detected, the ROI is extracted and optionally resized for processing:
# ROI sizing configuration
TARGET_ROI_SIZE = (320, 240)  # Width × Height
ROI_PADDING = 10               # Pixels added around detected face
Source: src/config.py:17-23
Rationale for downsampling:
  1. Computational efficiency: Lower resolution → faster pyramid construction and filtering
  2. Memory footprint: 320×240 = 76,800 pixels vs. 1920×1080 = 2,073,600 pixels (27× reduction)
  3. Signal quality: EVM operates on spatial averages; high resolution provides diminishing returns
  4. Physiological frequency preservation: Temporal resolution (FPS) matters more than spatial resolution
Trade-offs:
  • Loss of fine spatial detail (acceptable for vital signs)
  • Faster processing (critical for real-time performance)
  • Reduced noise (larger effective averaging area)

Detector Base Interface

All detector backends implement the BaseFaceDetector abstract class:
class BaseFaceDetector(ABC):
    """
    Abstract base class for face detection implementations.
    """
    
    @abstractmethod
    def detect(self, frame):
        """
        Detect faces in the provided frame.
        
        Args:
            frame (numpy.ndarray): Input image in BGR format.
            
        Returns:
            tuple or None: Face coordinates as (x, y, w, h) or None if no face detected.
        """
        pass

    @abstractmethod
    def close(self):
        """
        Release any resources used by the detector.
        """
        pass
Source: src/face_detector/base.py:1-34 This abstraction allows seamless swapping of detection backends without changing downstream code.

Performance Characteristics

Detection Speed (Approximate)

BackendFPS (CPU)FPS (GPU)LatencyAccuracy
Haar Cascade60-100N/A~10msMedium
MTCNN10-2030-50~50msHigh
YOLO (v8n)30-45100-200~20msHigh
MediaPipe40-6080-120~15msHigh
Performance varies significantly based on hardware. Values shown are for 640×480 input on typical hardware (Intel i5, NVIDIA GTX 1060).

Stabilization Impact

Without Stabilization:
  • ROI jitter: ±5-20 pixels per frame
  • Signal noise: High-frequency artifacts in temporal signal
  • Detection failures: Occasional frame drops create signal discontinuities
With Stabilization:
  • ROI jitter: ±1-3 pixels per frame (80-90% reduction)
  • Signal noise: Smoother temporal signals improve FFT quality
  • Detection failures: Historical buffer maintains ROI through brief failures

Usage Example

Basic Usage

from src.face_detector.manager import FaceDetector
import cv2

# Initialize detector (default: Haar Cascade)
detector = FaceDetector(model_type="haar")

# Process frame
frame = cv2.imread("face.jpg")
roi = detector.detect_face(frame)

if roi:
    x, y, w, h = roi
    face_region = frame[y:y+h, x:x+w]
    print(f"Face detected at ({x}, {y}) with size {w}×{h}")
else:
    print("No face detected")

# Cleanup
detector.close()

Advanced: Multiple Backends

# Compare different backends
backends = ["haar", "mtcnn", "yolo", "mediapipe"]
results = {}

for backend in backends:
    detector = FaceDetector(model_type=backend)
    roi = detector.detect_face(frame)
    results[backend] = roi
    detector.close()

print("Detection results:", results)

Integration with EVM

from src.face_detector.manager import FaceDetector
from src.evm.evm_manager import process_video_evm_vital_signs
import cv2

detector = FaceDetector(model_type="yolo")
video_frames = []

# Capture video and extract ROIs
cap = cv2.VideoCapture(0)
for _ in range(90):  # 3 seconds at 30 FPS
    ret, frame = cap.read()
    if ret:
        roi = detector.detect_face(frame)
        if roi:
            x, y, w, h = roi
            roi_frame = frame[y:y+h, x:x+w]
            roi_frame = cv2.resize(roi_frame, (320, 240))
            video_frames.append(roi_frame)

cap.release()
detector.close()

# Process with EVM
results = process_video_evm_vital_signs(video_frames, verbose=True)
print(f"Heart Rate: {results['heart_rate']} BPM")
print(f"Respiratory Rate: {results['respiratory_rate']} RPM")

Troubleshooting

Possible causes:
  • Poor lighting conditions
  • Face too small or too large in frame
  • Extreme head pose (profile view)
  • Occlusions (glasses, mask, hair)
Solutions:
  • Ensure adequate, even lighting
  • Position camera 0.5-2 meters from subject
  • Use frontal or near-frontal poses
  • Try different detector backend (MTCNN or YOLO for challenging cases)
Possible causes:
  • Detector confidence fluctuations
  • Subject movement
  • Insufficient stabilization history
Solutions:
  • Wait for 3-5 frames to build stabilization buffer
  • Increase ROI_WEIGHTS for older frames
  • Reduce ROI_CHANGE_THRESHOLD for more aggressive stabilization
Behavior: Most detectors return only the largest/most confident faceIf this causes issues:
  • Modify detector to track specific person (custom logic needed)
  • Ensure only one person in frame
  • Use detector with face recognition capability (e.g., MTCNN with embedding comparison)

Eulerian Video Magnification

Learn how ROI frames are processed for vital signs

Signal Processing

Understand how stable ROI improves signal quality

System Overview

See face detection in the full pipeline

API Reference

Explore FaceDetector API documentation

Build docs developers (and LLMs) love