How it works
Audio processing in Cluely follows a sophisticated multi-step pipeline:Recording audio
Voice recording flow
When you record audio in Cluely, the audio is processed through the voice pipeline:Supported formats
Cluely supports multiple audio formats:- MP3 (
audio/mpeg) - Compressed audio, smaller file size - WAV (
audio/wav) - Uncompressed audio, higher quality
Audio files are automatically detected when added to the screenshot queue. If the last item ends with
.mp3 or .wav, it’s processed as audio instead of an image.Transcription
How transcription works
Audio is sent to Gemini’s multimodal API for transcription:Accuracy
Transcription quality depends on:- Audio quality - Clear recordings transcribe better
- Background noise - Minimize ambient sound
- Speaking clarity - Speak at a normal pace
- Language - English is best supported
Voice interpretation
Understanding intent
After transcription, Cluely analyzes the transcript to understand what you’re asking for:Interpretation output
The interpretation provides structured understanding:Response generation
Contextual answers
Based on the interpretation, Cluely generates a targeted response:Response quality
Responses are optimized for:- Relevance - Directly addresses the spoken request
- Completeness - Includes all key requirements from interpretation
- Clarity - Uses Markdown formatting for readability
- Actionability - Provides concrete next steps when applicable
Audio file processing
Upload and process
You can process existing audio files:Processing workflow
When an audio file is in the queue (electron/ProcessingHelper.ts:69):Use cases
Code explanations
Record yourself describing a coding problem and get solutions
Quick notes
Capture ideas verbally and get structured summaries
Learning assistance
Ask questions out loud and receive detailed explanations
Debugging help
Describe an error verbally and get troubleshooting steps
Best practices
Record in a quiet environment
Record in a quiet environment
Background noise can reduce transcription accuracy:
- Use headphones with a microphone for better isolation
- Close windows and doors to minimize ambient sound
- Turn off fans or noisy equipment
Speak clearly and naturally
Speak clearly and naturally
Optimal speaking technique:
- Use your normal speaking voice (don’t whisper or shout)
- Speak at a moderate pace
- Pronounce technical terms carefully
- Pause between sentences
Structure your request
Structure your request
For best results, organize your spoken input:
- State the problem clearly
- Mention any constraints or requirements
- Specify the expected format (code, explanation, steps)
Keep recordings focused
Keep recordings focused
- Stay on topic for each recording
- Keep recordings under 1-2 minutes when possible
- Record separate clips for unrelated questions
Technical details
Audio encoding
Audio is encoded as base64 for transmission:Model requirements
Audio processing requires Gemini API access:Even if you’re using Ollama or OpenRouter for text/vision, voice features always use Gemini since it’s the only provider with audio understanding capabilities.
IPC handlers
The main process exposes these audio-related IPC handlers:Troubleshooting
Transcription returns empty text
Transcription returns empty text
Causes:
- Audio file is corrupted or empty
- Audio format not supported
- API key issue
- Verify the audio file plays correctly
- Check that GEMINI_API_KEY is set
- Try re-recording with better quality
Error: Gemini API key required
Error: Gemini API key required
Cause: No Gemini API key configuredSolution: Add to your
.env file:Poor transcription quality
Poor transcription quality
Improvements:
- Record in a quieter environment
- Use an external microphone instead of built-in
- Speak more slowly and clearly
- Ensure audio levels aren’t too low or distorted