Overview
Gima AI Chatbot features a sophisticated voice input system with dual-mode operation: Gemini AI transcription for high accuracy and Web Speech API fallback for offline capability. The system automatically switches modes based on availability and network conditions.
Voice commands use Gemini 2.5 Flash Lite for transcription and command parsing, ensuring high accuracy with low latency.
Gemini AI Mode
Web Speech API
Server-Side Transcription The primary mode uses Google’s Gemini API for audio transcription: Advantages:
Higher accuracy than browser-based recognition
Consistent across all browsers and devices
Advanced prompt engineering for clean output
Automatic timestamp and filler word removal
Requirements:
Active internet connection
Valid GOOGLE_GENERATIVE_AI_API_KEY
Browser support for MediaRecorder API
// Transcription with Gemini
export async function transcribeAudio (
audioDataUrl : string ,
mimeType : string = 'audio/webm'
) : Promise <{ text : string ; success : boolean ; error ?: string }> {
const result = await generateText ({
model: google ( 'gemini-2.5-flash-lite' ),
temperature: 0 ,
messages: [
{
role: 'user' ,
content: [
{ type: 'text' , text: VOICE_PROMPT },
{ type: 'file' , data: base64Content , mediaType: mimeType },
],
},
],
});
// Clean timestamps and normalize spacing
const cleanText = result . text
. replace ( / \d {1,2} : \d {2} / g , '' )
. replace ( / \n + / g , ' ' )
. replace ( / \s + / g , ' ' )
. trim ();
return { text: cleanText , success: true };
}
Browser-Based Recognition Fallback mode using the browser’s native Speech Recognition API: Advantages:
Works offline
No API costs
Real-time interim results
Zero server latency
Limitations:
Browser and OS dependent
Not available in all browsers (Safari, older Firefox)
Accuracy varies by environment
// Native speech recognition setup
const startNativeListening = useCallback (() => {
const SpeechRecognition =
window . SpeechRecognition || window . webkitSpeechRecognition ;
const recognition = new SpeechRecognition ();
recognition . continuous = true ;
recognition . interimResults = true ;
recognition . lang = 'es-ES' ;
recognition . onresult = ( event ) => {
let fullTranscript = '' ;
for ( let i = 0 ; i < event . results . length ; i ++ ) {
if ( event . results [ i ]. isFinal ) {
fullTranscript += event . results [ i ][ 0 ]. transcript ;
}
}
setTranscript ( fullTranscript );
onTranscript ?.( fullTranscript );
};
recognition . start ();
}, [ language ]);
Automatic Mode Switching
The system intelligently switches between modes based on availability:
Capability Detection
On mount, the system checks for:
MediaRecorder API support (Gemini mode)
SpeechRecognition API support (fallback mode)
Network connectivity status
Mode Selection
useEffect (() => {
const hasMediaRecorder = !! window . MediaRecorder ;
const hasSpeechRecognition = !! getSpeechRecognition ();
setIsSupported ( hasMediaRecorder || hasSpeechRecognition );
// Prefer Gemini if available AND online
if ( hasMediaRecorder && navigator . onLine ) {
setMode ( 'gemini' );
} else if ( hasSpeechRecognition ) {
setMode ( 'native' );
}
}, []);
Error Handling & Fallback
If Gemini transcription fails (network issues, quota exceeded), the system automatically falls back to Web Speech API: catch ( err ) {
const hasSpeechRecognition = !! getSpeechRecognition ();
if ( hasSpeechRecognition ) {
const userFriendlyError = simplifyGeminiError ( err . message );
setError ( userFriendlyError );
setMode ( 'native' );
} else {
setError ( VOICE_MESSAGES . BROWSER_NOT_SUPPORTED );
}
}
Voice Command Parsing
Transcripts can be parsed into structured commands using AI:
// Execute voice command
const result = await executeVoiceCommand (
"Crear orden urgente para la UMA" ,
{ minConfidence: 0.7 }
);
if ( result . success ) {
console . log ( result . command . action ); // 'create_work_order'
console . log ( result . command . equipment ); // 'UMA'
console . log ( result . command . priority ); // 'urgent'
}
Command Parser Service
The VoiceCommandParserService uses Gemini to interpret natural language:
export class VoiceCommandParserService extends BaseAIService {
public async parseCommand (
transcript : string ,
options ?: VoiceParserOptions
) : Promise <{ success : boolean ; command ?: VoiceCommand ; error ?: string }> {
const result = await generateText ({
model: google ( 'gemini-2.5-flash-lite' ),
messages: [
{
role: 'system' ,
content: MASTER_VOICE_PROMPT + contextPrompt ,
},
{
role: 'user' ,
content: transcript ,
},
],
temperature: 0 ,
});
// Parse and validate with Zod schema
const validation = VoiceCommandSchema . safeParse ( parsed );
if ( validation . success && command . confidence >= minConfidence ) {
return { success: true , command: validation . data };
}
return { success: false , error: 'Low confidence' };
}
}
The system dynamically detects the best audio format for the browser:
export function getSupportedAudioMimeType () : string {
const types = [
'audio/webm;codecs=opus' ,
'audio/webm' ,
'audio/ogg;codecs=opus' ,
'audio/mp4' ,
];
for ( const type of types ) {
if ( MediaRecorder . isTypeSupported ( type )) {
return type ;
}
}
throw new Error ( 'No supported audio MIME type found' );
}
Safari Support : Safari on iOS/macOS typically uses audio/mp4, while Chrome uses audio/webm;codecs=opus.
File Size Limits
Audio transcription enforces size limits to ensure fast processing:
// From app/config/limits.ts
export const MAX_AUDIO_SIZE_MB = 5 ;
export const MAX_AUDIO_SIZE_BYTES = 5 * 1024 * 1024 ;
// Validation in action
const sizeInBytes = getBase64Size ( base64Content );
const sizeInMB = bytesToMB ( sizeInBytes );
if ( sizeInMB > MAX_AUDIO_SIZE_MB ) {
throw new Error (
`Audio demasiado grande ( ${ sizeInMB . toFixed ( 1 ) } MB). Máximo: ${ MAX_AUDIO_SIZE_MB } MB`
);
}
import { useVoiceInput } from '@/app/hooks/use-voice-input' ;
function VoiceInputComponent () {
const {
isListening ,
isProcessing ,
transcript ,
isSupported ,
mode ,
toggleListening ,
error
} = useVoiceInput ({
onTranscript : ( text ) => {
console . log ( 'Transcribed:' , text );
setInputValue ( text );
},
onError : ( error ) => {
console . error ( 'Voice error:' , error );
},
language: 'es-ES'
});
return (
< div >
< button
onClick = { toggleListening }
disabled = {isProcessing || ! isSupported }
>
{ isListening ? 'Stop Recording' : 'Start Recording' }
</ button >
< div >
Mode : { mode } | Status : { isListening ? 'Listening' : 'Idle' }
</ div >
{ error && < div className = "error" > { error } </ div > }
{ transcript && < div className = "transcript" > { transcript } </ div > }
</ div >
);
}
Error Messages
User-friendly error messages are provided for common issues:
Error Type Message Resolution Quota Exceeded ”⚠️ Límite de API alcanzado · Modo local activo” API quota limit reached, fallback activated API Key Missing ”🔑 API Key no configurada · Modo local activo” Set GOOGLE_GENERATIVE_AI_API_KEY Permission Denied ”🚫 Permiso de micrófono denegado” Grant microphone permissions in browser No Connection ”📡 Sin conexión · Modo local activo” Check internet connection Browser Not Supported ”❌ Tu navegador no soporta reconocimiento de voz” Use a modern browser
Configuration
# .env.local
GOOGLE_GENERATIVE_AI_API_KEY = your_api_key_here
# Optional: Custom voice prompt
VOICE_PROMPT = "Transcribe this audio accurately..."
Best Practices
User Feedback Show visual indicators during recording and processing states
Graceful Degradation Always provide fallback options when features aren’t available
Privacy Inform users when audio is being sent to external APIs
Testing Test voice input across different browsers and devices
Multimodal Chat Learn about the full chat experience
Image Analysis Explore visual recognition capabilities
PDF Processing Discover document analysis features