Audio intelligence

Cluely’s audio intelligence feature allows you to record voice memos, upload audio files, and get instant AI-powered transcriptions and contextual responses.

How it works

Audio processing in Cluely follows a sophisticated multi-step pipeline:

Audio capture

Record audio directly in Cluely or upload audio files (MP3, WAV)

Transcription

AI automatically transcribes the audio to text using Google’s Gemini models

Interpretation

The transcript is analyzed to extract intent, context, and expected outcomes

Response generation

A contextual response is generated based on the interpreted request

Recording audio

Voice recording flow

When you record audio in Cluely, the audio is processed through the voice pipeline:

public async processVoiceRecording(data: string, mimeType: string) {
  const transcriptResult = await this.llmHelper.analyzeAudioFromBase64(data, mimeType)
  const interpretation = await this.llmHelper.interpretVoiceTranscript(transcriptResult.text)
  const voiceAnswer = await this.llmHelper.generateVoiceResponse(
    transcriptResult.text, 
    interpretation
  )
  
  return {
    transcript: transcriptResult.text,
    interpretation,
    answer: voiceAnswer
  }
}

Supported formats

Cluely supports multiple audio formats:

MP3 (audio/mpeg) - Compressed audio, smaller file size
WAV (audio/wav) - Uncompressed audio, higher quality

Audio files are automatically detected when added to the screenshot queue. If the last item ends with .mp3 or .wav, it’s processed as audio instead of an image.

Transcription

How transcription works

Audio is sent to Gemini’s multimodal API for transcription:

public async analyzeAudioFromBase64(data: string, mimeType: string) {
  const client = await this.getGeminiClient()
  const audioPart = {
    inlineData: {
      data,
      mimeType
    }
  }
  
  const prompt = `${this.systemPrompt}

Describe this audio clip in a short, concise answer. 
In addition to your main answer, suggest several possible actions 
or responses the user could take next based on the audio.`
  
  const result = await this.generateContentWithRetry([
    { parts: [{ text: prompt }, audioPart] }
  ])
  
  return { text: result.candidates[0].content.parts[0].text, timestamp: Date.now() }
}

Accuracy

Transcription quality depends on:

Audio quality - Clear recordings transcribe better
Background noise - Minimize ambient sound
Speaking clarity - Speak at a normal pace
Language - English is best supported

Audio transcription requires a Gemini API key. It’s not available with Ollama or other local models since they don’t support audio input.

Voice interpretation

Understanding intent

After transcription, Cluely analyzes the transcript to understand what you’re asking for:

public async interpretVoiceTranscript(transcript: string) {
  const prompt = `The user spoke: "${transcript}"

Interpret the request and respond with JSON:
{
  "problem_statement": "Concise restatement of the user's request",
  "context": "Relevant background or assumptions",
  "expected_outcome": "What result the user wants",
  "key_requirements": ["Must-have requirements"],
  "clarifications_needed": ["Questions if anything is ambiguous"],
  "suggested_responses": ["High-level solution directions"],
  "reasoning": "How you interpreted the request"
}`
  
  const result = await this.generateContentWithRetry([{ parts: [{ text: prompt }] }])
  return JSON.parse(this.cleanJsonResponse(result.candidates[0].content.parts[0].text))
}

Interpretation output

The interpretation provides structured understanding:

{
  "problem_statement": "Explain how to implement binary search in JavaScript",
  "context": "User wants a code example with explanation",
  "expected_outcome": "Working implementation with time complexity analysis",
  "key_requirements": [
    "Must be in JavaScript",
    "Should include comments",
    "Explain time complexity"
  ],
  "clarifications_needed": [],
  "suggested_responses": [
    "Provide iterative implementation",
    "Provide recursive implementation",
    "Include example usage"
  ],
  "reasoning": "User is learning algorithms and needs a practical example"
}

Response generation

Contextual answers

Based on the interpretation, Cluely generates a targeted response:

public async generateVoiceResponse(transcript: string, interpretation?: any): Promise<string> {
  const reasoning = interpretation?.reasoning 
    ? `\nReasoning: ${interpretation.reasoning}` 
    : ""
  const outcome = interpretation?.expected_outcome 
    ? `\nExpected outcome: ${interpretation.expected_outcome}` 
    : ""
  const requirements = Array.isArray(interpretation?.key_requirements)
    ? `\nKey requirements: ${interpretation.key_requirements.join("; ")}` 
    : ""
  
  const prompt = `You are Wingman AI responding to: "${transcript}"${outcome}${requirements}${reasoning}

Provide a direct, helpful answer. Focus on delivering the explanation or result 
they asked for. Keep it concise, practical, and on-topic.`
  
  const result = await this.generateContentWithRetry([{ parts: [{ text: prompt }] }])
  return result.candidates[0].content.parts[0].text
}

Response quality

Responses are optimized for:

Relevance - Directly addresses the spoken request
Completeness - Includes all key requirements from interpretation
Clarity - Uses Markdown formatting for readability
Actionability - Provides concrete next steps when applicable

Audio file processing

Upload and process

You can process existing audio files:

// From file path
const result = await window.electronAPI.analyzeAudioFile("/path/to/audio.mp3")

// From base64 data
const result = await window.electronAPI.analyzeAudioBase64(
  base64Data,
  "audio/mpeg"
)

Processing workflow

When an audio file is in the queue (electron/ProcessingHelper.ts:69):

const allPaths = this.appState.getScreenshotHelper().getScreenshotQueue()
const lastPath = allPaths[allPaths.length - 1]

if (lastPath.endsWith('.mp3') || lastPath.endsWith('.wav')) {
  const audioBuffer = await fs.promises.readFile(lastPath)
  const extension = path.extname(lastPath).toLowerCase()
  const mimeType = extension === '.wav' ? 'audio/wav' : 'audio/mpeg'
  
  await this.processVoiceRecording(
    audioBuffer.toString('base64'), 
    mimeType
  )
}

Use cases

Code explanations

Record yourself describing a coding problem and get solutions

Quick notes

Capture ideas verbally and get structured summaries

Learning assistance

Ask questions out loud and receive detailed explanations

Debugging help

Describe an error verbally and get troubleshooting steps

Best practices

Record in a quiet environment

Background noise can reduce transcription accuracy:

Use headphones with a microphone for better isolation
Close windows and doors to minimize ambient sound
Turn off fans or noisy equipment

Speak clearly and naturally

Optimal speaking technique:

Use your normal speaking voice (don’t whisper or shout)
Speak at a moderate pace
Pronounce technical terms carefully
Pause between sentences

Structure your request

For best results, organize your spoken input:

State the problem clearly
Mention any constraints or requirements
Specify the expected format (code, explanation, steps)

Keep recordings focused

Stay on topic for each recording
Keep recordings under 1-2 minutes when possible
Record separate clips for unrelated questions

Technical details

Audio encoding

Audio is encoded as base64 for transmission:

const audioData = await fs.promises.readFile(audioPath)
const audioPart = {
  inlineData: {
    data: audioData.toString("base64"),
    mimeType: "audio/mp3"
  }
}

Model requirements

Audio processing requires Gemini API access:

private async getGeminiClient(): Promise<GoogleGenAI> {
  const apiKey = this.resolveGeminiApiKey()
  if (!apiKey) {
    throw new Error("Gemini API key is required for voice features")
  }
  
  if (!this.geminiVoiceClient) {
    this.geminiVoiceClient = new GoogleGenAI({ apiKey })
  }
  return this.geminiVoiceClient
}

Even if you’re using Ollama or OpenRouter for text/vision, voice features always use Gemini since it’s the only provider with audio understanding capabilities.

IPC handlers

The main process exposes these audio-related IPC handlers:

// Analyze audio from base64 data
ipcMain.handle("analyze-audio-base64", async (event, data: string, mimeType: string) => {
  return await appState.processingHelper.processAudioBase64(data, mimeType)
})

// Process voice recording with full interpretation
ipcMain.handle("process-voice-recording", async (_, data: string, mimeType: string) => {
  return await appState.processingHelper.processVoiceRecording(data, mimeType)
})

// Analyze audio file from path
ipcMain.handle("analyze-audio-file", async (event, path: string) => {
  return await appState.processingHelper.processAudioFile(path)
})

Troubleshooting

Transcription returns empty text

Causes:

Audio file is corrupted or empty
Audio format not supported
API key issue

Solutions:

Verify the audio file plays correctly
Check that GEMINI_API_KEY is set
Try re-recording with better quality

Error: Gemini API key required

Cause: No Gemini API key configuredSolution: Add to your .env file:

GEMINI_API_KEY=your_api_key_here

Poor transcription quality

Improvements:

Record in a quieter environment
Use an external microphone instead of built-in
Speak more slowly and clearly
Ensure audio levels aren’t too low or distorted

Get Started

Core Features

AI Providers

Guides

Audio intelligence

How it works

Recording audio

Voice recording flow

Supported formats

Transcription

How transcription works

Accuracy

Voice interpretation

Understanding intent

Interpretation output

Response generation

Contextual answers

Response quality

Audio file processing

Upload and process

Processing workflow

Use cases

Code explanations

Quick notes

Learning assistance

Debugging help

Best practices

Technical details

Audio encoding

Model requirements

IPC handlers

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Features

AI Providers

Guides

​How it works

​Recording audio

​Voice recording flow

​Supported formats

​Transcription

​How transcription works

​Accuracy

​Voice interpretation

​Understanding intent

​Interpretation output

​Response generation

​Contextual answers

​Response quality

​Audio file processing

​Upload and process

​Processing workflow

​Use cases

Code explanations

Quick notes

Learning assistance

Debugging help

​Best practices

​Technical details

​Audio encoding

​Model requirements

​IPC handlers

​Troubleshooting

Build docs developers (and LLMs) love

How it works

Recording audio

Voice recording flow

Supported formats

Transcription

How transcription works

Accuracy

Voice interpretation

Understanding intent

Interpretation output

Response generation

Contextual answers

Response quality

Audio file processing

Upload and process

Processing workflow

Use cases

Best practices

Technical details

Audio encoding

Model requirements

IPC handlers

Troubleshooting