Skip to main content

Overview

The Voice Actions module provides server-side functions for processing audio input, transcribing speech to text, and parsing voice commands using Google’s Gemini AI models.

transcribeAudio

Transcribes audio files using the Gemini Flash Lite model with specialized prompting to clean timestamps and filler words.
export async function transcribeAudio(
  audioDataUrl: string,
  mimeType: string = 'audio/webm'
): Promise<{ text: string; success: boolean; error?: string }>

Parameters

audioDataUrl
string
required
Base64-encoded audio string (data:audio/…)
mimeType
string
default:"audio/webm"
MIME type of the audio file

Response

text
string
Transcribed and cleaned text from the audio
success
boolean
Whether the transcription succeeded
error
string
Error message if transcription failed

Features

  • Size Validation: Enforces maximum audio size limits (configured in MAX_AUDIO_SIZE_MB)
  • Automatic Cleaning: Removes timestamps (00:00, 01:23, etc.) and excessive line breaks
  • Error Handling: Returns structured error responses with logging

Example

const result = await transcribeAudio("data:audio/webm;base64,GkXfo59ChoEBQveBAULygQRC...");

if (result.success) {
  console.log("Transcription:", result.text);
} else {
  console.error("Error:", result.error);
}

executeVoiceCommand

Parses a voice transcription into a structured command using AI-powered natural language understanding.
export async function executeVoiceCommand(
  transcript: string,
  options?: { minConfidence?: number; context?: string }
): Promise<
  | { success: true; command: VoiceCommand }
  | { success: false; error: string; code: string; recoverable: boolean }
>

Parameters

transcript
string
required
Transcribed text from voice input
options
object
Optional parsing configuration
options.minConfidence
number
Minimum confidence threshold (0-1) for accepting parsed commands
options.context
string
Additional context to help the AI understand the command

Response (Success)

success
boolean
Returns true on successful parsing
command
VoiceCommand
Parsed command object with action type and parameters

Response (Failure)

success
boolean
Returns false on parsing failure
error
string
Human-readable error message
code
string
Error code: MISSING_API_KEY, PARSING_FAILED, or EXECUTION_ERROR
recoverable
boolean
Whether the error is recoverable (e.g., user can retry)

Features

  • API Key Validation: Checks for GOOGLE_GENERATIVE_AI_API_KEY before processing
  • Structured Validation: Uses Zod schemas for command validation
  • Language Support: Configured for Spanish (es-ES) commands
  • Confidence Scoring: Filters low-confidence interpretations

Example

const result = await executeVoiceCommand(
  "Crear orden urgente para la UMA",
  { minConfidence: 0.8, context: "work orders" }
);

if (result.success) {
  console.log("Action:", result.command.action);
  console.log("Parameters:", result.command.parameters);
} else {
  console.error(`Error [${result.code}]:`, result.error);
  console.log("Recoverable:", result.recoverable);
}

Error Codes

CodeDescriptionRecoverable
MISSING_API_KEYGoogle AI API key not configuredNo
PARSING_FAILEDCould not parse command from transcriptYes
EXECUTION_ERRORUnexpected error during processingNo

Configuration

Environment Variables

  • GOOGLE_GENERATIVE_AI_API_KEY: Required for all voice operations
  • MAX_AUDIO_SIZE_MB: Maximum audio file size (defined in config/limits)

Dependencies

  • @ai-sdk/google: Google AI SDK for Gemini models
  • ai: Vercel AI SDK for text generation
  • VoiceCommandParserService: Internal service for command parsing

Models Used

  • Transcription: gemini-2.5-flash-lite (optimized for speed)
  • Command Parsing: Configured via VoiceCommandParserService

Build docs developers (and LLMs) love