Overview
EmoChat is built as a client-server web application that performs real-time emotion recognition. The system integrates computer vision, machine learning, and generative AI to provide an empathetic emotional analysis experience.Architecture Diagram
System Components
1. Frontend (JavaScript + HTML)
Files:index.html, main.js
The web-based user interface handles:
Webcam Access
Frame Capture
Captures frames every 1 second:Image Encoding
Session Recording
Tracks emotions during 30-second sessions:Frontend Responsibilities:
- Camera permissions and access
- Real-time video streaming
- Frame extraction and encoding
- UI updates and user feedback
- Session management
- Gemini AI result display
2. Backend (Flask Server)
File:app.py
The Python Flask server provides two main endpoints:
/predict Endpoint
Handles real-time emotion detection:
/analyze_session Endpoint
Handles session analysis with Gemini AI:
Backend Responsibilities:
- HTTP request/response handling
- Image decoding and preprocessing
- Facial landmark detection
- ML model inference
- Gemini AI integration
- Error handling and validation
3. Computer Vision Module
File:utils.py
Provides core facial analysis functionality:
4. Machine Learning Pipeline
Training Files:prepare_data.py, train_model.py
Inference: Loaded model in app.py
5. External Integration
Gemini AI: Google’s generative AI for empathetic analysisData Flow
Emotion Detection Flow
-
Webcam Capture (Client)
- Browser requests camera access
- Video stream starts at native resolution
- Canvas element captures frames
-
Frame Processing (Client)
- Every 1 second, draw video frame to canvas
- Convert canvas to JPEG with 80% quality
- Encode as Base64 data URL
-
Network Transfer (Client → Server)
- HTTP POST to
/predict - JSON payload with Base64 image
- Async fetch request
- HTTP POST to
-
Image Decoding (Server)
- Parse Base64 string
- Decode to NumPy array
- Convert to OpenCV BGR format
-
Face Detection (Server)
- Haar Cascade scans grayscale image
- Returns bounding boxes of detected faces
- Processes only first detected face
-
Landmark Extraction (Server)
- LBF model fits 68 points to face
- Returns raw (x, y) coordinates
-
Feature Normalization (Server)
- Calculate face bounding box
- Normalize each coordinate to [0, 1]
- Return flat 136-element array
-
Model Prediction (Server)
- Pass features to Random Forest
- 200 trees vote on classification
- Return majority class (0 or 1)
-
Response (Server → Client)
- Map integer to emotion label
- Return JSON with emotion string
-
UI Update (Client)
- Display emotion in overlay
- Update CSS classes for styling
- Track emotion if recording session
Session Analysis Flow
-
User Input (Client)
- User provides context (what they’ll talk about)
- Clicks “Grabar Análisis (30s)” button
-
Recording Session (Client)
- 30-second countdown starts
- Each detected emotion is appended to array
- Timer displays remaining seconds
-
Session Completion (Client)
- After 30 seconds, recording stops
- Emotion array and context packaged
-
API Request (Client → Server)
- HTTP POST to
/analyze_session - JSON with context text and emotions array
- HTTP POST to
-
Prompt Construction (Server)
- Format user context
- Include emotion timeline
- Add empathetic instruction
-
Gemini API Call (Server → External)
- Send prompt to Gemini 2.5 Flash
- Wait for AI-generated response
-
Response (Server → Client)
- Return Gemini’s analysis text
- Display in UI results section
File Structure
Component Responsibilities
index.html
index.html
Purpose: Web application UI structureKey Elements:
- Video element for webcam stream
- Canvas for frame capture (hidden)
- Control buttons (Start, Stop, Record)
- Context input textarea
- Results display areas
- Emotion information cards
main.js
main.js
Purpose: Client-side interaction logicResponsibilities:
- Webcam initialization and control
- Frame capture and encoding
- API communication
- Session recording management
- UI updates based on responses
- Error handling and user feedback
sendFrameForPrediction()- Sends frames to/predictstartRecording()- Begins 30s emotion trackingstopRecordingAndAnalyze()- Sends to/analyze_session
app.py
app.py
Purpose: HTTP server and API layerResponsibilities:
- Flask application setup
- Route handling (
/,/predict,/analyze_session) - Request validation
- Image decoding
- Model orchestration
- Response formatting
- Error handling
utils.py
utils.py
Purpose: Computer vision utilitiesResponsibilities:
- Model download and caching
- Face detection configuration
- Landmark extraction
- Feature normalization
- Optional visualization
train_model.py
train_model.py
Purpose: ML model trainingResponsibilities:
- Load training data from
data.txt - Split into train/test sets (80/20)
- Train Random Forest classifier
- Evaluate accuracy and confusion matrix
- Serialize model to disk
prepare_data.py
prepare_data.py
Purpose: Training data preprocessingResponsibilities:
- Read images from emotion folders
- Extract facial landmarks from each image
- Assign integer labels (alphabetical order)
- Save as NumPy text file
test_model.py
test_model.py
Purpose: Real-time model testingResponsibilities:
- Open webcam stream
- Process frames in real-time
- Display landmarks and predictions
- Verify model before deployment
Technology Stack
Frontend
HTML5
Semantic structure, video element, canvas API
JavaScript (ES6+)
Async/await, MediaDevices API, Fetch API
CSS3
Modern styling, animations, responsive design
Backend
Python 3
Core language for all backend logic
Flask
Lightweight web framework for API
OpenCV
Computer vision and facial analysis
Machine Learning
scikit-learn
Random Forest classifier, metrics
NumPy
Numerical computing and arrays
Pickle
Model serialization
External Services
Google Gemini AI
Generative AI for empathetic session analysis (gemini-2.5-flash model)
Deployment Considerations
Local Development
Must access via
http:// (not file://) for webcam permissions to work.Production Deployment
Environment Variables
Dependencies
Performance Characteristics
Latency Breakdown
Typical processing time for one frame:| Stage | Time | Notes |
|---|---|---|
| Network transfer | 10-50ms | Depends on connection |
| Base64 decode | 5-10ms | Image size dependent |
| Face detection | 10-30ms | Varies with image complexity |
| Landmark extraction | 20-40ms | Fixed 68 points |
| Normalization | <1ms | Simple math operations |
| Model prediction | <1ms | Random Forest inference |
| Response encoding | <1ms | Small JSON payload |
| Total | 50-130ms | Well under 1 second budget |
Scalability
Current Limitations:- Synchronous processing (one request at a time)
- No request queuing
- Single-threaded Flask server
- No caching
- Use production WSGI server with multiple workers
- Implement request queuing (Celery, RQ)
- Cache model in shared memory
- Use GPU for OpenCV operations (if available)
- CDN for static assets
Security Considerations
Data Privacy
From the UI (index.html:207):
Error Handling
Client-Side Errors
Server-Side Errors
Next Steps
Emotion Recognition
Deep dive into facial landmark detection
ML Model
Understand the Random Forest classifier

