Voice Processing

Overview

VoicePact’s voice processing pipeline transforms spoken business agreements into structured, machine-readable contract terms. The system uses Whisper for speech-to-text transcription combined with custom NLP extraction logic to identify key contract elements like parties, products, pricing, and delivery terms.

The voice processor achieves high accuracy on agricultural supply agreements, which are common in East African informal markets. The confidence scoring system helps identify when manual review is needed.

Processing Pipeline

The voice-to-contract flow follows these stages:

┌─────────────────┐    ┌──────────────────┐    ┌────────────────────┐
│  Audio Input    │───▶│  Transcription   │───▶│  Term Extraction   │
│ (URL or File)   │    │  (Whisper AI)    │    │  (NLP Rules)       │
└─────────────────┘    └──────────────────┘    └────────────────────┘
         │                      │                        │
         ▼                      ▼                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌────────────────────┐
│  Audio Download │    │  Language Model  │    │  Confidence Score  │
│  & Validation   │    │  Processing      │    │  Calculation       │
└─────────────────┘    └──────────────────┘    └────────────────────┘

1. Audio Acquisition

The system supports two input methods:

URL-based: Downloads audio from Africa’s Talking voice recording URLs
File-based: Processes local audio files for testing

Implementation (voice_processor.py:68-78):

async def download_audio(self, audio_url: str) -> str:
    try:
        response = await self.http_client.get(audio_url)
        response.raise_for_status()
        
        with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as temp_file:
            temp_file.write(response.content)
            return temp_file.name
    except Exception as e:
        logger.error(f"Failed to download audio: {e}")
        raise VoiceProcessingError(f"Audio download failed: {e}")

2. Audio Validation

Before processing, the system validates:

File existence and accessibility
File size limits (configurable via settings.max_audio_file_size)
Format support (WAV, MP3, M4A, etc.)

See voice_processor.py:57-66 for validation logic.

3. Speech Transcription

VoicePact uses OpenAI Whisper running locally for privacy and cost efficiency:

def _initialize_model(self):
    try:
        model_size = settings.whisper_model_size  # e.g., "base", "medium"
        logger.info(f"Loading Whisper model: {model_size}")
        self.model = whisper.load_model(model_size)
        logger.info(f"Whisper model {model_size} loaded successfully")
    except Exception as e:
        logger.error(f"Failed to load Whisper model: {e}")
        raise VoiceProcessingError(f"Model initialization failed: {e}")

Key features:

Async processing via executor to avoid blocking
English language default (multi-language support planned)
Returns full transcript as plain text

Whisper models are loaded once at startup and reused across requests for optimal performance. The base model offers a good balance of speed and accuracy for production use.

4. Contract Term Extraction

The extract_contract_terms() method (voice_processor.py:108-124) uses pattern matching to extract:

Field	Example Pattern	Code Reference
Product	”50 bags of maize”	`_extract_product()` (126-141)
Quantity	”100 bags”, “2 tons”	`_extract_quantity()` (143-155)
Unit Price	”KES 3000 per bag”	`_extract_unit_price()` (166-182)
Total Amount	”Total KES 150,000”	`_extract_total_amount()` (184-200)
Delivery Location	”deliver to Nairobi”	`_extract_location()` (212-226)
Delivery Deadline	”by March 15th”	`_extract_deadline()` (228-245)
Quality	”Grade A”, “dry maize”	`_extract_quality()` (247-261)
Payment Terms	”30% upfront”	`_extract_payment_terms()` (288-300)

Example extraction (voice_processor.py:166-182):

def _extract_unit_price(self, text: str) -> Optional[Decimal]:
    patterns = [
        r"(?:kes|ksh)\s*(\d+(?:,\d{3})*(?:\.\d{2})?)\s*(?:per\s+bag|each)",
        r"(\d+(?:,\d{3})*)\s*(?:per\s+bag|each)",
        r"price.*?(\d+(?:,\d{3})*(?:\.\d{2})?)",
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            try:
                price_str = match.group(1).replace(',', '')
                return Decimal(price_str)
            except:
                continue
    
    return None

5. Confidence Scoring

The system calculates a completeness confidence score (0.0 to 1.0) based on extracted fields:

def _calculate_confidence(self, terms: ContractTerms) -> float:
    score = 0.0
    max_score = 8.0
    
    if terms.product:
        score += 1.0
    if terms.quantity:
        score += 1.0
    if terms.unit_price or terms.total_amount:
        score += 1.5  # Pricing is critical
    if terms.currency:
        score += 0.5
    if terms.delivery_location:
        score += 1.0
    if terms.delivery_deadline:
        score += 1.0
    if terms.quality_requirements:
        score += 1.0
    if terms.payment_terms or terms.upfront_payment:
        score += 1.0
    
    return min(score / max_score, 1.0)

Contracts with confidence scores below 0.6 should be flagged for manual review. Pricing fields carry higher weight as they’re essential for escrow.

Complete Processing Example

Here’s the full processing flow (voice_processor.py:302-329):

async def process_voice_to_contract(
    self, 
    audio_source: str,
    is_url: bool = True
) -> Dict[str, Any]:
    try:
        # Step 1: Get transcript
        if is_url:
            transcript = await self.transcribe_from_url(audio_source)
        else:
            transcript = await self.transcribe_audio(audio_source)
        
        # Step 2: Extract terms
        terms = self.extract_contract_terms(transcript)
        
        # Step 3: Return structured result
        return {
            "transcript": transcript,
            "terms": terms.dict(),
            "processing_status": "completed",
            "word_count": len(transcript.split()),
            "confidence_score": self._calculate_confidence(terms)
        }
    except Exception as e:
        logger.error(f"Voice to contract processing failed: {e}")
        return {
            "transcript": "",
            "terms": {},
            "processing_status": "failed",
            "error": str(e)
        }

Performance Benchmarks

Metric	Target	Notes
Transcription (5 min audio)	15-30s	Depends on Whisper model size
Term Extraction	< 100ms	Pattern matching is fast
End-to-end (5 min audio)	20-35s	Dominated by transcription
Confidence (high-quality audio)	> 0.75	Clear speech, structured conversation

Integration with Contract Generation

The voice processor is typically used in conjunction with the contract generator:

from app.services.voice_processor import get_voice_processor
from app.services.contract_generator import get_contract_generator

# Process voice
processor = await get_voice_processor()
result = await processor.process_voice_to_contract(
    audio_source="https://voice.africastalking.com/recording/abc123"
)

# Generate contract
generator = get_contract_generator()
contract = generator.create_contract(
    transcript=result["transcript"],
    terms=ContractTerms(**result["terms"]),
    parties=parties,
    contract_type="agricultural_supply"
)

Error Handling

The processor raises VoiceProcessingError for critical failures:

Audio download failures
Invalid audio format
Transcription errors
Model initialization failures

All errors are logged with context for debugging.

Configuration

Key settings (defined in app.core.config):

whisper_model_size: str = "base"  # or "tiny", "small", "medium", "large"
max_audio_file_size: int = 50_000_000  # 50 MB
supported_audio_formats: List[str] = ["wav", "mp3", "m4a", "flac"]

Best Practices

For high accuracy:

Structured conversations - Guide parties to state terms clearly
Audio quality - Use good connections; avoid background noise
Clear speech - Speak slowly and enunciate numbers
Kenyan English - Whisper handles accents well, but clarity helps
Confirmation - Always review extracted terms before finalizing

Contract Lifecycle - How contracts progress from draft to completion
Verification - Multi-modal party confirmation
Voice API Endpoint - REST API for voice processing

Get Started

Core Concepts

Features

Integrations

Analytics

Overview

Processing Pipeline

1. Audio Acquisition

2. Audio Validation

3. Speech Transcription

4. Contract Term Extraction

5. Confidence Scoring

Complete Processing Example

Performance Benchmarks

Integration with Contract Generation

Error Handling

Configuration

Best Practices

Build docs developers (and LLMs) love

Get Started

Core Concepts

Features

Integrations

Analytics

​Overview

​Processing Pipeline

​1. Audio Acquisition

​2. Audio Validation

​3. Speech Transcription

​4. Contract Term Extraction

​5. Confidence Scoring

​Complete Processing Example

​Performance Benchmarks

​Integration with Contract Generation

​Error Handling

​Configuration

​Best Practices

​Related

Build docs developers (and LLMs) love

Overview

Processing Pipeline

1. Audio Acquisition

2. Audio Validation

3. Speech Transcription

4. Contract Term Extraction

5. Confidence Scoring

Complete Processing Example

Performance Benchmarks

Integration with Contract Generation

Error Handling

Configuration

Best Practices

Related