Introduction
Google Cloud provides comprehensive audio AI capabilities powered by state-of-the-art models for speech recognition, text-to-speech synthesis, and music generation. These services enable you to build sophisticated audio applications with natural-sounding voices, accurate transcription, and high-fidelity music generation.Chirp (Universal Speech Model)
Chirp is Google’s Universal Speech Model (USM) that powers both speech recognition and text-to-speech capabilities on Google Cloud.Speech-to-Text with Chirp 3
Chirp 3 is the latest speech recognition model offering:- Multilingual support: Transcribe audio in multiple languages with high accuracy
- Language-agnostic transcription: Automatically detect and transcribe the dominant language
- Speaker diarization: Identify different speakers in audio conversations
- Streaming recognition: Real-time transcription of audio streams
- Batch processing: Transcribe longer audio files stored in Cloud Storage
Text-to-Speech with Chirp 3 HD Voices
Chirp 3 HD Voices deliver natural-sounding speech synthesis powered by large language models:- High-fidelity audio: Studio-quality voice output
- Natural expressiveness: Human-like intonation, pauses, and emotional nuance
- Multiple voice options: 8 distinct voices (4 male, 4 female)
- 31 languages: Broad language support for global applications
- Streaming synthesis: Generate speech in real-time
Chirp models are available in specific regions. Check the Speech-to-Text regional availability and Text-to-Speech endpoints documentation for details.
Lyria 2 Music Generation
Lyria 2 is Google’s latest music generation model available on Vertex AI, capable of creating high-fidelity audio tracks across various genres.Key Capabilities
- Genre diversity: Generate music across classical, electronic, rock, jazz, hip hop, pop, and more
- Style control: Create cinematic, ambient, lo-fi, and other stylistic variations
- Mood and emotion: Fine-tune the emotional tone of generated music
- Tempo and instrumentation: Specify tempo, instruments, and musical characteristics
- High-quality output: 30-second WAV audio at 48kHz sample rate
Use Cases
Voice Assistants
Create conversational AI with natural speech recognition and synthesis
Audiobooks
Generate expressive narration with Chirp HD voices
Customer Service
Build IVR systems with speech-to-text and text-to-speech
Media Production
Generate background music and soundtracks with Lyria 2
Accessibility
Create audio descriptions and transcription services
Language Learning
Build pronunciation practice and transcription tools
Getting Started
Enable the APIs
Enable the Speech-to-Text API, Text-to-Speech API, and Vertex AI API in your Google Cloud project.
Try your first request
Start with speech recognition or text-to-speech synthesis. See the Speech Recognition guide for detailed examples.
API Comparison
| Feature | Speech-to-Text (Chirp 3) | Text-to-Speech (Chirp 3 HD) | Music Generation (Lyria 2) |
|---|---|---|---|
| Primary Use | Audio to text transcription | Text to speech synthesis | Music generation from prompts |
| Input Format | Audio files, streams | Text strings | Text prompts |
| Output Format | JSON with transcription | Audio (MP3, WAV, LINEAR16) | WAV audio (48kHz) |
| Real-time Support | Yes (streaming) | Yes (streaming) | No (30-second clips) |
| Language Support | 100+ languages | 31 languages | Language-agnostic |
| Key Features | Diarization, auto-language detection | Natural intonation, HD voices | Genre control, mood tuning |
Code Example: Speech Recognition
Code Example: Text-to-Speech
Code Example: Music Generation
Best Practices
Speech Recognition
- Use batch recognition for audio files longer than 1 minute
- Enable speaker diarization when you need to identify multiple speakers
- Set
language_codes=["auto"]for automatic language detection - Use streaming recognition for real-time applications like voice assistants
Text-to-Speech
- Select appropriate voice variants based on your use case (formal vs. conversational)
- Use SSML tags for fine-grained control over pronunciation and pacing
- Enable streaming synthesis to reduce latency in real-time applications
- Consider audio encoding formats based on bandwidth and quality requirements
Music Generation
- Be specific in prompts: Include genre, tempo, instruments, and mood
- Use negative prompts to exclude unwanted characteristics
- Generate multiple samples and select the best result
- Experiment with different style descriptors for varied outputs
Resources
- Speech-to-Text Documentation
- Text-to-Speech Documentation
- Vertex AI Music Generation
- Chirp 3 Model Details
- Chirp 3 HD Voices
- GitHub Samples Repository
Next Steps
Speech Recognition
Learn how to transcribe audio with Chirp 3
Pricing Information
View pricing for audio APIs