Why interruptions matter
Humans naturally interrupt each other in conversation:- Barge-in - “Actually, I need to—”
- Backchannel - “Uh-huh”, “mm-hmm”, “I see”
- Clarification - “Wait, what was that last part?”
- Correction - “No, that’s not my address”
- Noise (ignore)
- Backchannels (acknowledge but keep talking)
- Real interruptions (stop and listen)
Configuration overview
Interruption settings live in the agent configuration:Turn-by-turn mode
The simplest approach: strict turn-taking with no interruptions allowed.Enable strict turn-taking
true- Agent speaks, waits for silence, then listensfalse- Users can interrupt mid-speech (barge-in enabled)
When agent is interrupted, include what it was saying in context
true- AI knows what was cut offfalse- AI only sees what was actually spokennull- Use system default
- Formal interactions (legal disclosures, compliance scripts)
- Noisy environments where false interruptions are common
- Simple IVR-style menus
Turn end detection
Determines when the user has finished speaking so the agent can respond.VAD (Voice Activity Detection)
Type:VAD
Uses signal processing to detect speech vs. silence.
Minimum milliseconds of speech to register as “user started talking”
Milliseconds of silence to register as “user finished talking”
- Fastest response time (no API calls)
- Deterministic and predictable
- Works offline
- May cut off slow speakers
- Can’t distinguish between pause and completion
- Sensitive to noise
STT (Speech-to-Text)
Type:STT
Uses your STT provider’s endpointing logic.
Configuration example:
- More accurate than VAD
- Provider-optimized algorithms
- Language-aware
- Slightly slower than VAD
- Depends on provider quality
- Requires network round-trip
ML (Machine Learning)
Type:ML
Uses a specialized ML model trained to predict turn completion.
Minimum speech duration before ML model activates
Minimum silence before ML model evaluates
Maximum wait time before forcing turn end
- Best at distinguishing pauses from completion
- Adapts to speaking patterns
- Reduces false triggers
- Adds latency (model inference time)
- Requires ML infrastructure
- May need tuning per language
AI (LLM-based)
Type:AI
Uses an LLM to analyze if the user’s statement is complete.
true- Use the agent’s configured LLMfalse- Use dedicated LLM (specify inLLMIntegration)
Custom LLM configuration (if
UseAgentLLM: false)- Semantic understanding of completion
- Best for complex, multi-turn exchanges
- Context-aware decisions
- Highest latency (LLM API call)
- Non-deterministic
- Higher cost
Pause trigger
Determines when to pause the agent’s speech if the user starts talking (barge-in detection).Enable pause trigger (null = disabled)
VAD- Voice activity detectionSTT- Speech-to-text based
VAD pause trigger
Milliseconds of speech detected to trigger pause
STT pause trigger
Number of words transcribed to trigger pause
| Type | Latency | Accuracy | Use Case |
|---|---|---|---|
| VAD | 300ms | Moderate | Fast-paced conversations |
| STT | 500-800ms | High | Avoid false positives from noise |
Interruption verification
After pausing, verify if the interruption was intentional or just noise/backchanneling.Enable LLM-based verification
true- Use agent’s LLMfalse- Use dedicated LLM (specify inLLMIntegration)
Custom LLM configuration (if
UseAgentLLM: false)Verification adds ~300ms latency but dramatically improves conversation naturalness by preventing false interruptions.
Configuration strategies
Strategy 1: Fast and simple
Use case: High-volume IVR, simple transactions- No barge-in
- Fastest response time
- Deterministic behavior
Strategy 2: Natural conversations
Use case: Customer service, general assistants- Barge-in enabled
- Distinguishes backchannels from interruptions
- Balanced latency and accuracy
Strategy 3: Maximum accuracy
Use case: Therapy, coaching, high-stakes consultations- Semantic understanding at all stages
- Highest accuracy
- Higher latency and cost (justified for high-value use cases)
Strategy 4: Noisy environments
Use case: Call centers, outdoor applications- Higher thresholds to avoid false positives
- STT + verification reduce noise interruptions
- Slightly slower but more reliable
Testing interruptions
Test backchannels
While agent is speaking, say short acknowledgments:
- “Okay”
- “Mm-hmm”
- “I see”
Test real interruptions
While agent is speaking, say:
- “Wait, stop”
- “That’s wrong”
- “I have a question”
Best practices
Match culture and context
- Western cultures - More interruptions expected, enable barge-in
- Eastern cultures - More respectful turn-taking, consider turn-by-turn mode
- Formal contexts - Stricter turn-taking
- Casual contexts - More flexible interruptions
Tune for audience
- Young adults - Fast VAD thresholds (200ms silence)
- Elderly users - Slow VAD thresholds (500-700ms silence)
- Non-native speakers - STT or ML turn detection (better at handling pauses)
Provide feedback
When paused, give audio cues:Monitor false positives
Track metrics:- Interruptions per conversation
- Average interruption latency
- Backchannel vs. real interruption ratio
Use dedicated LLMs for verification
For high-traffic agents, use a faster/cheaper model for verification:Latency breakdown
| Configuration | Pause Detection | Turn End Detection | Verification | Total |
|---|---|---|---|---|
| VAD only | ~100ms | ~100ms | - | ~200ms |
| VAD + STT | ~100ms | ~300ms | - | ~400ms |
| STT + Verification | ~300ms | ~300ms | ~300ms | ~900ms |
| ML + Verification | ~200ms | ~400ms | ~300ms | ~900ms |
| AI (full LLM) | ~400ms | ~500ms | ~300ms | ~1200ms |
Next steps
Agent configuration
Complete agent settings reference
Visual IDE
Build conversation scripts
Integrations
Configure LLM and STT providers