Overview
Audio transcription enables:- Automatic speech-to-text - Convert audio recordings to searchable text
- Multiple implementations - Choose local or cloud-based processing
- Multi-language support - Process audio in many languages
- Indexed results - Transcriptions added to full-text search index
- Quality scoring - Word-level confidence scores (implementation dependent)
Available Implementations
IPED supports multiple transcription engines:Local Processing
Vosk (Default)
Best for: Quick setup, CPU-only systems- Runs entirely on CPU
- No external dependencies
- Included models: English, Portuguese (Brazil)
- Medium accuracy
- Fast processing
Wav2Vec2
Best for: High accuracy with GPU- GPU highly recommended (10x faster)
- Better accuracy than Vosk
- HuggingFace model support
- Requires additional setup
Whisper
Best for: Best accuracy, GPU required- Highest accuracy available
- Multiple model sizes (tiny to large-v3)
- GPU strongly recommended
- Multilingual support
- 4x slower than Wav2Vec2
Remote Service
Best for: Distributed processing- Offload processing to remote server
- Share GPU resources across nodes
- Network-based communication
- Centralized resource management
Cloud Services
Microsoft Azure
Best for: Enterprise deployments, high volume- Azure subscription and API key
- Microsoft Speech SDK JAR in plugins folder
- Pass subscription key:
-XazureSubscriptionKey=XXXXXXXX
Google Cloud Speech
Best for: Advanced features, multiple languages- Google Cloud account and credentials
- Google Cloud Speech JAR with dependencies
- Environment variable:
GOOGLE_APPLICATION_CREDENTIALS
Configuration
Audio transcription is configured inAudioTranscriptConfig.txt:
Implementation-Specific Options
Vosk Configuration
Wav2Vec2 Configuration
Whisper Configuration
Remote Service Configuration
Azure Configuration
Google Cloud Configuration
Supported Audio Formats
IPED transcribes common audio formats:- 3GP/3G2 - Mobile recordings
- AAC - Advanced Audio Coding
- AIFF - Audio Interchange File Format
- AMR - Adaptive Multi-Rate codec (mobile)
- MP4 Audio - MPEG-4 audio tracks
- OGG Vorbis/Opus - Open audio formats
- WAV - Waveform Audio File Format
- WMA - Windows Media Audio
- CAF - Core Audio Format
- iLBC - Internet Low Bitrate Codec
- Enable video processing by adding video MIME types to
mimesToProcess - Update
convertCommandto extract audio from video
Audio Preprocessing
All audio is converted to standard format before transcription:- Speech recognition optimized for 16 kHz
- Mono sufficient for speech
- Reduces processing time
- Smaller temporary files
Language Detection
Auto Mode
LocalConfig.txt:
- Automatically matches case locale
- Consistent with UI language
- No manual configuration needed
Explicit Languages
Specify one or more languages:- English (en, en-US, en-GB)
- Portuguese (pt, pt-BR, pt-PT)
- Spanish (es, es-ES, es-MX)
- French (fr, fr-FR, fr-CA)
- German (de, de-DE)
- Italian (it, it-IT)
- Russian (ru, ru-RU)
- Chinese (zh, zh-CN)
- And many more…
Processing Flow
Per-Item Processing
- Filter items - Check MIME type and known status
- Convert audio - Standardize to 16kHz mono WAV
- Transcribe - Send to selected implementation
- Store results - Add to item extra attributes
- Index text - Make searchable in Lucene index
Transcription Results
Transcription stored as item attributes:- Full-text search
- Keyword highlighting
- Export in reports
- Timeline correlation
Performance Comparison
| Implementation | Speed (CPU) | Speed (GPU) | Accuracy | Setup |
|---|---|---|---|---|
| Vosk | Fast | N/A | Medium | Easy |
| Wav2Vec2 | Slow | Fast | High | Medium |
| Whisper | Very Slow | Medium | Highest | Medium |
| Azure | Fast | N/A | High | Easy |
| Fast | N/A | High | Easy | |
| Remote | Depends on server | - | Varies | Hard |
Use Cases
Call Recording Analysis
- Transcribe intercepted phone calls
- Search for keywords and phrases
- Identify speakers and topics
- Generate call summaries
Voice Message Processing
- WhatsApp/Telegram voice messages
- Social media audio posts
- Voicemail recordings
Interview Transcription
- Police interviews
- Witness statements
- Suspect interrogations
- Expert depositions
OSINT Audio
- Podcast monitoring
- Social media audio
- Public speeches
- News broadcasts
Quality Optimization
Improve Accuracy
-
Use appropriate model - Match audio characteristics
phone_callfor telephone recordingsvideofor video audio trackslatest_longfor long-form content
- Select correct language - Wrong language = poor results
-
Use better implementation
- Vosk → Wav2Vec2 → Whisper (increasing accuracy)
-
Audio quality matters
- Clear audio = better transcription
- Reduce background noise
- Avoid multiple speakers talking simultaneously
Improve Speed
- Use GPU - 10-20x speedup for Wav2Vec2/Whisper
- Batch processing - Increase Whisper batchSize on GPU
- Faster models - Whisper tiny/base vs. large
- Distributed processing - Remote service on multiple servers
-
Filter scope - Use
skipKnownFilesandmimesToProcess
Troubleshooting
No Transcription Generated
- Verify audio format in
mimesToProcess - Check audio file is not corrupted
- Review conversion command works
- Confirm implementation properly initialized
Low Accuracy
- Verify correct language selected
- Check audio quality (noise, clarity)
- Try better implementation (Whisper)
- Review speaker clarity and accent
Performance Issues
- Reduce concurrent processes
- Use GPU for Wav2Vec2/Whisper
- Try faster model (Whisper base vs. large)
- Enable
skipKnownFiles
Memory Errors
- Reduce Whisper batchSize
- Use int8 precision instead of float32
- Process fewer files concurrently
- Use smaller Whisper model
Security Considerations
Cloud Services
- Audio uploaded to third-party servers
- Review legal/privacy requirements
- Consider data sovereignty laws
- Use encryption in transit
- Clear audit trails
Local Processing
- All data stays on premises
- No external network calls
- Suitable for classified material
- Full control over data