Overview
Omnilingual ASR uses fairseq2’s asset management system to organize and load models, tokenizers, and datasets. Assets are defined in YAML files called “asset cards” that specify how to locate and configure each component.Asset Card Structure
Asset cards are YAML files that define how fairseq2 should load and configure models, tokenizers, and datasets. Multiple assets can be defined in a single file by separating them with---.
Model Asset Cards
Model cards define the architecture, checkpoint location, and associated tokenizer.Field Reference
name
name
Unique identifier used to load the asset via
Required: Yes
Example:
load_model("omniASR_CTC_300M_v2").Type: StringRequired: Yes
Example:
omniASR_CTC_300M_v2model_family
model_family
Maps to the model implementation class. Determines which model architecture will be instantiated.Type: String
Required: Yes
Options:
Required: Yes
Options:
wav2vec2_asr- CTC-based ASR modelswav2vec2_llama- LLM-based decoder modelswav2vec2_ssl- Self-supervised learning models
wav2vec2_asrmodel_arch
model_arch
Specific configuration variant for the model family. References architecture configs defined in the codebase.Type: String
Required: Yes
Examples:
Required: Yes
Examples:
- For
wav2vec2_asr:300m_v2,1b_v2,3b_v2,7b_v2 - For
wav2vec2_llama:300m_v2,1b_v2,3b_unlimited_v2
/src/omnilingual_asr/models/wav2vec2_*/config.pycheckpoint
checkpoint
URI pointing to the model weights. Supports multiple formats:Type: String
Required: Yes
Formats:
Required: Yes
Formats:
- HTTP URL:
https://dl.fbaipublicfiles.com/mms/model.pt - Local path:
$HOME/.cache/models/model.pt - HuggingFace:
hg://username/model-name(for.safetensorsformat)
https://dl.fbaipublicfiles.com/mms/omniASR-CTC-300M-v2.pttokenizer_ref
tokenizer_ref
References another asset card by name to load the associated tokenizer.Type: String
Required: Yes (for ASR models)
Example:
Required: Yes (for ASR models)
Example:
omniASR_tokenizer_written_v2Tokenizer Asset Cards
Tokenizers are defined separately and referenced by models.Tokenizer Fields
- name: Unique identifier referenced by
tokenizer_refin model cards - tokenizer_family: Implementation type (e.g.,
char_tokenizer) - tokenizer: URI to the tokenizer model file
Dataset Asset Cards
Dataset cards specify the location and configuration for training/evaluation data.Dataset Fields
- name: Unique identifier for loading the dataset
- dataset_family: Dataset implementation type
- dataset_config: Configuration parameters (varies by dataset family)
- data: Path to dataset directory containing parquet files
- tokenizer_ref: Reference to tokenizer asset
The
dataset_config.data path should point to a directory containing partitioned parquet files organized by language and corpus.Loading Assets in Code
Loading Models
Loading Tokenizers
Using in Training Configs
Reference assets by name in recipe YAML files:Creating Custom Asset Cards
Adding a Custom Model
- Duplicate an existing model card
- Update the
namefield with a unique identifier - Update the
checkpointfield to point to your model weights - Ensure
model_archmatches your model’s architecture
Adding Multiple Assets
Separate multiple asset definitions with--- in the same YAML file:
Asset Card Locations
Asset cards are stored in the/src/omnilingual_asr/cards/ directory:
- Models:
/src/omnilingual_asr/cards/models/ - Datasets:
/src/omnilingual_asr/cards/datasets/
Best Practices
- Use descriptive names: Include model size and version in the asset name
- Version your assets: Use suffixes like
_v2to track iterations - Organize by purpose: Keep model, tokenizer, and dataset cards in separate files
- Document checkpoints: Use comments to note training details or performance metrics
- Test locally first: Verify custom assets load correctly before deployment
Troubleshooting
Asset not found error
Asset not found error
Ensure the asset card is in a registered directory and the
name field matches exactly what you’re loading.Checkpoint download fails
Checkpoint download fails
Verify the URL is accessible and check your network connection. For local paths, ensure they use absolute paths or proper environment variables like
$HOME.Architecture mismatch
Architecture mismatch
Ensure
model_arch matches a valid configuration in /src/omnilingual_asr/models/{model_family}/config.py.