Overview
The Zero-Shot model (omniASR_LLM_7B_ZS) enables transcription of unseen languages through in-context learning. By providing 1-10 audio-text example pairs in the target language, the model learns the language’s patterns on-the-fly without requiring fine-tuning or retraining.
This model is particularly valuable for low-resource languages where labeled training data is scarce or unavailable. Simply provide a few example transcriptions to enable accurate speech recognition.
Model Specifications
| Specification | Value |
|---|---|
| Model Name | omniASR_LLM_7B_ZS |
| Parameters | 7,810,900,608 |
| Download Size | 30.0 GiB (FP32) |
| Inference VRAM | ~20 GiB (BF16, batch=1, 30s audio + context) |
| Speed | ~0.5x real-time (RTF: 0.194) |
| Max Audio Length | 60 seconds (40s recommended, 30s for context examples) |
| Context Examples | 1-10 required (internally normalized to 10) |
| Vocabulary Size | 9,812 tokens |
| Tokenizer | omniASR_tokenizer_v1 |
Architecture
The zero-shot model extends the LLM architecture with context example processing:Key Differences from Standard LLM
- Context Slots: Exactly 10 context example slots (filled via repetition if fewer than 10 provided)
- Input Grammar: Special token structure for interleaving audio/text context pairs
- Training: Exposed to diverse few-shot scenarios during training
- Validation: Enforces exactly 10 context examples at inference
In-Context Learning
How It Works
- Context Processing: Each audio-text pair is encoded separately
- Pattern Recognition: Model identifies phoneme-grapheme mappings from examples
- Application: Learned patterns applied to target audio
- Generation: Transcription generated using autoregressive decoding
- Phonetic patterns: Sound-to-symbol mappings
- Orthographic conventions: Writing system rules
- Language structure: Basic grammatical patterns
- Domain specifics: Vocabulary and terminology from examples
Example: New Language Transcription
Context Examples
Requirements
Context Example Requirements
Context Example Requirements
- Count: 1-10 examples (internally normalized to exactly 10)
- Audio Length: Up to 30 seconds per example (recommended)
- Quality: Clear audio with accurate transcriptions
- Diversity: Varied vocabulary and phonetic patterns
- Consistency: Same language/dialect across all examples
- Script: Consistent writing system (e.g., all Latin, all Cyrillic)
Example Repetition Logic
If fewer than 10 examples are provided, they are repeated sequentially to fill all 10 slots:Best Practice: Provide 5-10 unique examples for optimal performance. More diverse examples generally lead to better transcription quality.
Example Quality Guidelines
Good Context Examples
- Clear, noise-free audio
- Accurate, verified transcriptions
- Diverse vocabulary coverage
- Natural speech patterns
- Consistent speaker quality
- Representative of target domain
Avoid
- Noisy or low-quality audio
- Transcription errors
- Repetitive content
- Mixed languages/dialects
- Inconsistent writing systems
- Very short samples (under 5s)
Usage
Basic Zero-Shot Transcription
Batch Processing with Different Contexts
Using Shared Context for Multiple Files
Multiple Input Formats
Input Format Specification
The zero-shot model uses a specialized batch format:Performance Characteristics
Speed
- RTF: 0.194 (~0.5x real-time)
- Slowdown: ~2x slower than standard LLM models due to context processing
- Recommended Batch Size: 1 (large memory footprint with 10 context examples)
Memory Usage
Zero-shot model requires more VRAM than standard LLM models due to processing 10 context examples alongside target audio.
Accuracy Factors
| Factor | Impact on Accuracy |
|---|---|
| Number of examples (1 vs 10) | High - more examples generally better |
| Example quality | High - clear audio and accurate text crucial |
| Example diversity | Medium - varied content helps generalization |
| Example length | Low - 10-30s recommended, diminishing returns beyond |
| Language similarity to training | Medium - closer languages may perform better |
Limitations
Zero-Shot Model Limitations
Zero-Shot Model Limitations
- VRAM Requirements: ~20 GiB for BF16 inference (higher than standard models)
- Speed: ~2x slower than standard LLM models
- Audio Length: 40-60 seconds max (30s recommended for context examples)
- Batch Size: Limited to 1-2 due to memory constraints
- Context Requirement: Must provide context examples (cannot use without)
- Fixed Slots: Always uses exactly 10 context slots (may repeat examples)
- No Streaming: Current implementation does not support streaming
Use Cases
Low-Resource Languages
Transcribe languages with limited or no existing ASR support by providing just a few example pairs.
Domain Adaptation
Adapt to specialized vocabulary (medical, legal, technical) using domain-specific examples.
Dialectal Variation
Handle regional dialects by providing examples in the specific dialect.
New Writing Systems
Support languages with unique scripts by demonstrating phoneme-grapheme mappings.
Quick Prototyping
Rapidly test ASR capabilities for new languages without training infrastructure.
Research Applications
Investigate cross-lingual transfer and few-shot learning in speech recognition.
Best Practices
Optimal Number of Examples
Example Selection Strategy
- Phonetic Coverage: Include examples with diverse phonemes
- Vocabulary Diversity: Use different words and phrases
- Natural Speech: Prefer conversational over read speech
- Clear Audio: Ensure high-quality recordings
- Accurate Transcriptions: Verify all text is correct
- Consistent Style: Maintain uniform transcription conventions
Memory Optimization
Comparison with Other Models
| Feature | Zero-Shot (7B ZS) | Standard LLM (7B v2) | CTC (7B v2) |
|---|---|---|---|
| Context Examples | Required (1-10) | Not supported | Not supported |
| Language Conditioning | Via examples | Optional lang ID | None |
| Unseen Languages | Yes | Limited | No |
| Speed (RTF) | 0.194 (~0.5x) | 0.092 (~1x) | 0.006 (16x) |
| VRAM | ~20 GiB | ~17 GiB | ~15 GiB |
| Max Audio Length | 60s | 40s | 40s |
| Use Case | New languages | Known languages | High throughput |
Advanced Usage
Programmatic Context Generation
Evaluation with Context
Troubleshooting
Error: Must use .transcribe_with_context()
Error: Must use .transcribe_with_context()
The zero-shot model does not support
.transcribe(). Always use .transcribe_with_context() with at least one context example.Out of Memory Error
Out of Memory Error
Zero-shot model requires ~20 GiB VRAM. Try:
- Reduce batch size to 1
- Use shorter context examples (under 20s each)
- Clear CUDA cache:
torch.cuda.empty_cache() - Use gradient checkpointing (if training)
Poor Transcription Quality
Poor Transcription Quality
Improve accuracy by:
- Providing more context examples (5-10 recommended)
- Ensuring context examples have accurate transcriptions
- Using clear, high-quality audio for context
- Matching context examples to target domain/dialect
- Verifying consistent writing system across all examples
Slow Inference Speed
Slow Inference Speed
Zero-shot model is inherently slower due to context processing. To optimize:
- Use batch_size=1 (higher batches don’t help due to memory)
- Consider standard LLM models if language is supported
- Process context examples once, cache embeddings (advanced)
Next Steps
LLM Models
Explore standard LLM models with language conditioning
Model Specifications
Compare all model variants with detailed specs
Inference Guide
Complete guide to transcription workflows
Supported Languages
View the complete list of 1600+ languages