Overview
VoicePact’s voice processing pipeline transforms spoken business agreements into structured, machine-readable contract terms. The system uses Whisper for speech-to-text transcription combined with custom NLP extraction logic to identify key contract elements like parties, products, pricing, and delivery terms.The voice processor achieves high accuracy on agricultural supply agreements, which are common in East African informal markets. The confidence scoring system helps identify when manual review is needed.
Processing Pipeline
The voice-to-contract flow follows these stages:1. Audio Acquisition
The system supports two input methods:- URL-based: Downloads audio from Africa’s Talking voice recording URLs
- File-based: Processes local audio files for testing
voice_processor.py:68-78):
2. Audio Validation
Before processing, the system validates:- File existence and accessibility
- File size limits (configurable via
settings.max_audio_file_size) - Format support (WAV, MP3, M4A, etc.)
voice_processor.py:57-66 for validation logic.
3. Speech Transcription
VoicePact uses OpenAI Whisper running locally for privacy and cost efficiency:- Async processing via executor to avoid blocking
- English language default (multi-language support planned)
- Returns full transcript as plain text
Whisper models are loaded once at startup and reused across requests for optimal performance. The
base model offers a good balance of speed and accuracy for production use.4. Contract Term Extraction
Theextract_contract_terms() method (voice_processor.py:108-124) uses pattern matching to extract:
| Field | Example Pattern | Code Reference |
|---|---|---|
| Product | ”50 bags of maize” | _extract_product() (126-141) |
| Quantity | ”100 bags”, “2 tons” | _extract_quantity() (143-155) |
| Unit Price | ”KES 3000 per bag” | _extract_unit_price() (166-182) |
| Total Amount | ”Total KES 150,000” | _extract_total_amount() (184-200) |
| Delivery Location | ”deliver to Nairobi” | _extract_location() (212-226) |
| Delivery Deadline | ”by March 15th” | _extract_deadline() (228-245) |
| Quality | ”Grade A”, “dry maize” | _extract_quality() (247-261) |
| Payment Terms | ”30% upfront” | _extract_payment_terms() (288-300) |
voice_processor.py:166-182):
5. Confidence Scoring
The system calculates a completeness confidence score (0.0 to 1.0) based on extracted fields:Contracts with confidence scores below 0.6 should be flagged for manual review. Pricing fields carry higher weight as they’re essential for escrow.
Complete Processing Example
Here’s the full processing flow (voice_processor.py:302-329):
Performance Benchmarks
| Metric | Target | Notes |
|---|---|---|
| Transcription (5 min audio) | 15-30s | Depends on Whisper model size |
| Term Extraction | < 100ms | Pattern matching is fast |
| End-to-end (5 min audio) | 20-35s | Dominated by transcription |
| Confidence (high-quality audio) | > 0.75 | Clear speech, structured conversation |
Integration with Contract Generation
The voice processor is typically used in conjunction with the contract generator:Error Handling
The processor raisesVoiceProcessingError for critical failures:
- Audio download failures
- Invalid audio format
- Transcription errors
- Model initialization failures
Configuration
Key settings (defined inapp.core.config):
Best Practices
For high accuracy:
- Structured conversations - Guide parties to state terms clearly
- Audio quality - Use good connections; avoid background noise
- Clear speech - Speak slowly and enunciate numbers
- Kenyan English - Whisper handles accents well, but clarity helps
- Confirmation - Always review extracted terms before finalizing
Related
- Contract Lifecycle - How contracts progress from draft to completion
- Verification - Multi-modal party confirmation
- Voice API Endpoint - REST API for voice processing