Qwen2-Audio provides AI-powered analysis of various speech properties including speaker characteristics, voice properties, speech content, delivery style, and recording environment.
Model Setup
qwen2_model_setup(
model_tag = "Qwen/Qwen2-Audio-7B-Instruct" ,
start_prompt = "The following is a conversation with an AI assistant. The assistant is helpful, honest, and harmless."
)
Parameters
model_tag
str
default: "Qwen/Qwen2-Audio-7B-Instruct"
Hugging Face model identifier. Use "default" for the default 7B model
System prompt for the conversation context
Returns
Dictionary containing:
model - Qwen2AudioForConditionalGeneration model
processor - AutoProcessor for the model
start_conversation - Initial conversation setup
Available Metrics
1. Speaker Characteristics
Speaker Count
Speaker Gender
Speaker Age
Speech Impairment
qwen2_speaker_count_metric(qwen_utils, pred_x, fs = 16000 , custom_prompt = None )
2. Voice Properties
Voice Pitch
Pitch Range
Voice Type
Speech Volume
qwen2_voice_pitch_metric(qwen_utils, pred_x, fs = 16000 , custom_prompt = None )
3. Speech Content
Language
Speech Register
Vocabulary Complexity
Speech Purpose
qwen2_language_metric(qwen_utils, pred_x, fs = 16000 , custom_prompt = None )
4. Speech Delivery
Speech Emotion
Speech Clarity
Speech Rate
Speaking Style
Laughter/Crying
qwen2_speech_emotion_metric(qwen_utils, pred_x, fs = 16000 , custom_prompt = None )
5. Interaction Patterns
qwen2_overlapping_speech_metric(qwen_utils, pred_x, fs = 16000 , custom_prompt = None )
6. Recording Environment
Background Environment
Recording Quality
Channel Type
qwen2_speech_background_environment_metric(qwen_utils, pred_x, fs = 16000 , custom_prompt = None )
7. Vocal Evaluation
qwen2_singing_technique_metric(qwen_utils, pred_x, fs = 16000 , custom_prompt = None )
Common Parameters
All metric functions share the same parameter structure:
Utility dictionary from qwen2_model_setup() containing model, processor, and conversation
Audio signal as numpy array (1D)
Sampling rate in Hz. Audio is automatically resampled to match processor requirements
Optional custom prompt to override the default prompt for the metric
Returns
Each metric returns a dictionary with a single key-value pair:
Model’s textual response analyzing the specified property
Usage Examples
Basic Usage
import numpy as np
from versa import qwen2_model_setup, qwen2_speech_emotion_metric
# Setup model (do this once)
qwen_utils = qwen2_model_setup()
# Load your audio
audio = np.random.random( 16000 ) # Replace with actual audio
# Analyze emotion
result = qwen2_speech_emotion_metric(
qwen_utils = qwen_utils,
pred_x = audio,
fs = 16000
)
print ( f "Emotion: { result[ 'qwen_speech_emotion' ] } " )
Multiple Analyses
import numpy as np
from versa import (
qwen2_model_setup,
qwen2_speaker_count_metric,
qwen2_speaker_gender_metric,
qwen2_speech_emotion_metric,
qwen2_recording_quality_metric
)
# Setup once
qwen_utils = qwen2_model_setup()
audio = np.random.random( 48000 ) # Replace with actual audio
fs = 48000
# Run multiple analyses
analyses = {
"speaker_count" : qwen2_speaker_count_metric(qwen_utils, audio, fs),
"speaker_gender" : qwen2_speaker_gender_metric(qwen_utils, audio, fs),
"emotion" : qwen2_speech_emotion_metric(qwen_utils, audio, fs),
"quality" : qwen2_recording_quality_metric(qwen_utils, audio, fs)
}
for metric, result in analyses.items():
print ( f " { metric } : { result } " )
Custom Prompt
import numpy as np
from versa import qwen2_model_setup, qwen2_base_metric
qwen_utils = qwen2_model_setup()
audio = np.random.random( 16000 )
# Use custom prompt
custom_prompt = """Analyze this audio and describe the speaker's
emotional state in detail, including any subtle nuances."""
result = qwen2_base_metric(
qwen_utils = qwen_utils,
pred_x = audio,
fs = 16000 ,
custom_prompt = custom_prompt,
max_length = 1000
)
print (result)
Category Details
Speaker Count
Identifies number of distinct speakers (1-10)
Example outputs : "1", "2", "4"
Speaker Gender
Identifies perceived gender of speaker(s)
Example outputs : "Male", "Female", "Multiple speakers with mixed genders"
Speaker Age
Classifies age group
Categories : Child, Teen, Young adult, Middle-aged adult, Senior
Voice Pitch
Analyzes pitch level
Categories : Very high, High, Medium, Low, Very low
Speech Emotion
Identifies dominant emotion
Categories : Neutral, Happy, Sad, Angry, Fearful, Surprised, Disgusted, Other
Recording Quality
Assesses technical quality
Categories : Professional, Good, Fair, Poor, Very poor
Singing Technique
Identifies singing style
Categories : Breathy, Falsetto, Mixed Voice, Pharyngeal, Glissando, Vibrato, Control
Installation
pip install transformers accelerate
Requires latest version of transformers. If you encounter KeyError for Qwen2Audio, update transformers: pip install --upgrade transformers
Model Requirements
Aspect Requirement Model Size ~14 GB (7B parameters) GPU Memory Recommended 16+ GB VRAM Device Auto-selected with device_map="auto" Input Audio Automatically resampled
Inference Speed : Qwen2-Audio is a large language model. Inference may take several seconds per audio sample. Consider batching for production use.
Determinism : Results may vary slightly between runs due to the generative nature of the model.