Skip to main content

Overview

VoicePact’s voice processing pipeline transforms spoken business agreements into structured, machine-readable contract terms. The system uses Whisper for speech-to-text transcription combined with custom NLP extraction logic to identify key contract elements like parties, products, pricing, and delivery terms.
The voice processor achieves high accuracy on agricultural supply agreements, which are common in East African informal markets. The confidence scoring system helps identify when manual review is needed.

Processing Pipeline

The voice-to-contract flow follows these stages:
┌─────────────────┐    ┌──────────────────┐    ┌────────────────────┐
│  Audio Input    │───▶│  Transcription   │───▶│  Term Extraction   │
│ (URL or File)   │    │  (Whisper AI)    │    │  (NLP Rules)       │
└─────────────────┘    └──────────────────┘    └────────────────────┘
         │                      │                        │
         ▼                      ▼                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌────────────────────┐
│  Audio Download │    │  Language Model  │    │  Confidence Score  │
│  & Validation   │    │  Processing      │    │  Calculation       │
└─────────────────┘    └──────────────────┘    └────────────────────┘

1. Audio Acquisition

The system supports two input methods:
  • URL-based: Downloads audio from Africa’s Talking voice recording URLs
  • File-based: Processes local audio files for testing
Implementation (voice_processor.py:68-78):
async def download_audio(self, audio_url: str) -> str:
    try:
        response = await self.http_client.get(audio_url)
        response.raise_for_status()
        
        with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_file:
            temp_file.write(response.content)
            return temp_file.name
    except Exception as e:
        logger.error(f"Failed to download audio: {e}")
        raise VoiceProcessingError(f"Audio download failed: {e}")

2. Audio Validation

Before processing, the system validates:
  • File existence and accessibility
  • File size limits (configurable via settings.max_audio_file_size)
  • Format support (WAV, MP3, M4A, etc.)
See voice_processor.py:57-66 for validation logic.

3. Speech Transcription

VoicePact uses OpenAI Whisper running locally for privacy and cost efficiency:
def _initialize_model(self):
    try:
        model_size = settings.whisper_model_size  # e.g., "base", "medium"
        logger.info(f"Loading Whisper model: {model_size}")
        self.model = whisper.load_model(model_size)
        logger.info(f"Whisper model {model_size} loaded successfully")
    except Exception as e:
        logger.error(f"Failed to load Whisper model: {e}")
        raise VoiceProcessingError(f"Model initialization failed: {e}")
Key features:
  • Async processing via executor to avoid blocking
  • English language default (multi-language support planned)
  • Returns full transcript as plain text
Whisper models are loaded once at startup and reused across requests for optimal performance. The base model offers a good balance of speed and accuracy for production use.

4. Contract Term Extraction

The extract_contract_terms() method (voice_processor.py:108-124) uses pattern matching to extract:
FieldExample PatternCode Reference
Product”50 bags of maize”_extract_product() (126-141)
Quantity”100 bags”, “2 tons”_extract_quantity() (143-155)
Unit Price”KES 3000 per bag”_extract_unit_price() (166-182)
Total Amount”Total KES 150,000”_extract_total_amount() (184-200)
Delivery Location”deliver to Nairobi”_extract_location() (212-226)
Delivery Deadline”by March 15th”_extract_deadline() (228-245)
Quality”Grade A”, “dry maize”_extract_quality() (247-261)
Payment Terms”30% upfront”_extract_payment_terms() (288-300)
Example extraction (voice_processor.py:166-182):
def _extract_unit_price(self, text: str) -> Optional[Decimal]:
    patterns = [
        r"(?:kes|ksh)\s*(\d+(?:,\d{3})*(?:\.\d{2})?)\s*(?:per\s+bag|each)",
        r"(\d+(?:,\d{3})*)\s*(?:per\s+bag|each)",
        r"price.*?(\d+(?:,\d{3})*(?:\.\d{2})?)",
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            try:
                price_str = match.group(1).replace(',', '')
                return Decimal(price_str)
            except:
                continue
    
    return None

5. Confidence Scoring

The system calculates a completeness confidence score (0.0 to 1.0) based on extracted fields:
def _calculate_confidence(self, terms: ContractTerms) -> float:
    score = 0.0
    max_score = 8.0
    
    if terms.product:
        score += 1.0
    if terms.quantity:
        score += 1.0
    if terms.unit_price or terms.total_amount:
        score += 1.5  # Pricing is critical
    if terms.currency:
        score += 0.5
    if terms.delivery_location:
        score += 1.0
    if terms.delivery_deadline:
        score += 1.0
    if terms.quality_requirements:
        score += 1.0
    if terms.payment_terms or terms.upfront_payment:
        score += 1.0
    
    return min(score / max_score, 1.0)
Contracts with confidence scores below 0.6 should be flagged for manual review. Pricing fields carry higher weight as they’re essential for escrow.

Complete Processing Example

Here’s the full processing flow (voice_processor.py:302-329):
async def process_voice_to_contract(
    self, 
    audio_source: str,
    is_url: bool = True
) -> Dict[str, Any]:
    try:
        # Step 1: Get transcript
        if is_url:
            transcript = await self.transcribe_from_url(audio_source)
        else:
            transcript = await self.transcribe_audio(audio_source)
        
        # Step 2: Extract terms
        terms = self.extract_contract_terms(transcript)
        
        # Step 3: Return structured result
        return {
            "transcript": transcript,
            "terms": terms.dict(),
            "processing_status": "completed",
            "word_count": len(transcript.split()),
            "confidence_score": self._calculate_confidence(terms)
        }
    except Exception as e:
        logger.error(f"Voice to contract processing failed: {e}")
        return {
            "transcript": "",
            "terms": {},
            "processing_status": "failed",
            "error": str(e)
        }

Performance Benchmarks

MetricTargetNotes
Transcription (5 min audio)15-30sDepends on Whisper model size
Term Extraction< 100msPattern matching is fast
End-to-end (5 min audio)20-35sDominated by transcription
Confidence (high-quality audio)> 0.75Clear speech, structured conversation

Integration with Contract Generation

The voice processor is typically used in conjunction with the contract generator:
from app.services.voice_processor import get_voice_processor
from app.services.contract_generator import get_contract_generator

# Process voice
processor = await get_voice_processor()
result = await processor.process_voice_to_contract(
    audio_source="https://voice.africastalking.com/recording/abc123"
)

# Generate contract
generator = get_contract_generator()
contract = generator.create_contract(
    transcript=result["transcript"],
    terms=ContractTerms(**result["terms"]),
    parties=parties,
    contract_type="agricultural_supply"
)

Error Handling

The processor raises VoiceProcessingError for critical failures:
  • Audio download failures
  • Invalid audio format
  • Transcription errors
  • Model initialization failures
All errors are logged with context for debugging.

Configuration

Key settings (defined in app.core.config):
whisper_model_size: str = "base"  # or "tiny", "small", "medium", "large"
max_audio_file_size: int = 50_000_000  # 50 MB
supported_audio_formats: List[str] = ["wav", "mp3", "m4a", "flac"]

Best Practices

For high accuracy:
  1. Structured conversations - Guide parties to state terms clearly
  2. Audio quality - Use good connections; avoid background noise
  3. Clear speech - Speak slowly and enunciate numbers
  4. Kenyan English - Whisper handles accents well, but clarity helps
  5. Confirmation - Always review extracted terms before finalizing

Build docs developers (and LLMs) love