Skip to main content
Qwen2-Audio provides AI-powered analysis of various speech properties including speaker characteristics, voice properties, speech content, delivery style, and recording environment.

Model Setup

qwen2_model_setup(
    model_tag="Qwen/Qwen2-Audio-7B-Instruct",
    start_prompt="The following is a conversation with an AI assistant. The assistant is helpful, honest, and harmless."
)

Parameters

model_tag
str
default:"Qwen/Qwen2-Audio-7B-Instruct"
Hugging Face model identifier. Use "default" for the default 7B model
start_prompt
str
System prompt for the conversation context

Returns

qwen_utils
dict
Dictionary containing:
  • model - Qwen2AudioForConditionalGeneration model
  • processor - AutoProcessor for the model
  • start_conversation - Initial conversation setup

Available Metrics

1. Speaker Characteristics

qwen2_speaker_count_metric(qwen_utils, pred_x, fs=16000, custom_prompt=None)

2. Voice Properties

qwen2_voice_pitch_metric(qwen_utils, pred_x, fs=16000, custom_prompt=None)

3. Speech Content

qwen2_language_metric(qwen_utils, pred_x, fs=16000, custom_prompt=None)

4. Speech Delivery

qwen2_speech_emotion_metric(qwen_utils, pred_x, fs=16000, custom_prompt=None)

5. Interaction Patterns

qwen2_overlapping_speech_metric(qwen_utils, pred_x, fs=16000, custom_prompt=None)

6. Recording Environment

qwen2_speech_background_environment_metric(qwen_utils, pred_x, fs=16000, custom_prompt=None)

7. Vocal Evaluation

qwen2_singing_technique_metric(qwen_utils, pred_x, fs=16000, custom_prompt=None)

Common Parameters

All metric functions share the same parameter structure:
qwen_utils
dict
required
Utility dictionary from qwen2_model_setup() containing model, processor, and conversation
pred_x
numpy.ndarray
required
Audio signal as numpy array (1D)
fs
int
default:"16000"
Sampling rate in Hz. Audio is automatically resampled to match processor requirements
custom_prompt
str
Optional custom prompt to override the default prompt for the metric

Returns

Each metric returns a dictionary with a single key-value pair:
qwen_{metric_name}
str
Model’s textual response analyzing the specified property

Usage Examples

Basic Usage

import numpy as np
from versa import qwen2_model_setup, qwen2_speech_emotion_metric

# Setup model (do this once)
qwen_utils = qwen2_model_setup()

# Load your audio
audio = np.random.random(16000)  # Replace with actual audio

# Analyze emotion
result = qwen2_speech_emotion_metric(
    qwen_utils=qwen_utils,
    pred_x=audio,
    fs=16000
)

print(f"Emotion: {result['qwen_speech_emotion']}")

Multiple Analyses

import numpy as np
from versa import (
    qwen2_model_setup,
    qwen2_speaker_count_metric,
    qwen2_speaker_gender_metric,
    qwen2_speech_emotion_metric,
    qwen2_recording_quality_metric
)

# Setup once
qwen_utils = qwen2_model_setup()
audio = np.random.random(48000)  # Replace with actual audio
fs = 48000

# Run multiple analyses
analyses = {
    "speaker_count": qwen2_speaker_count_metric(qwen_utils, audio, fs),
    "speaker_gender": qwen2_speaker_gender_metric(qwen_utils, audio, fs),
    "emotion": qwen2_speech_emotion_metric(qwen_utils, audio, fs),
    "quality": qwen2_recording_quality_metric(qwen_utils, audio, fs)
}

for metric, result in analyses.items():
    print(f"{metric}: {result}")

Custom Prompt

import numpy as np
from versa import qwen2_model_setup, qwen2_base_metric

qwen_utils = qwen2_model_setup()
audio = np.random.random(16000)

# Use custom prompt
custom_prompt = """Analyze this audio and describe the speaker's 
emotional state in detail, including any subtle nuances."""

result = qwen2_base_metric(
    qwen_utils=qwen_utils,
    pred_x=audio,
    fs=16000,
    custom_prompt=custom_prompt,
    max_length=1000
)

print(result)

Category Details

Speaker Count

Identifies number of distinct speakers (1-10) Example outputs: "1", "2", "4"

Speaker Gender

Identifies perceived gender of speaker(s) Example outputs: "Male", "Female", "Multiple speakers with mixed genders"

Speaker Age

Classifies age group Categories: Child, Teen, Young adult, Middle-aged adult, Senior

Voice Pitch

Analyzes pitch level Categories: Very high, High, Medium, Low, Very low

Speech Emotion

Identifies dominant emotion Categories: Neutral, Happy, Sad, Angry, Fearful, Surprised, Disgusted, Other

Recording Quality

Assesses technical quality Categories: Professional, Good, Fair, Poor, Very poor

Singing Technique

Identifies singing style Categories: Breathy, Falsetto, Mixed Voice, Pharyngeal, Glissando, Vibrato, Control

Installation

pip install transformers accelerate
Requires latest version of transformers. If you encounter KeyError for Qwen2Audio, update transformers:
pip install --upgrade transformers

Model Requirements

AspectRequirement
Model Size~14 GB (7B parameters)
GPU MemoryRecommended 16+ GB VRAM
DeviceAuto-selected with device_map="auto"
Input AudioAutomatically resampled

Performance Notes

Inference Speed: Qwen2-Audio is a large language model. Inference may take several seconds per audio sample. Consider batching for production use.
Determinism: Results may vary slightly between runs due to the generative nature of the model.

Build docs developers (and LLMs) love