Overview
A machine learning system that detects a specific “wake word” in audio streams using LSTM neural networks and MFCC (Mel-Frequency Cepstral Coefficients) feature extraction. Built with TensorFlow/Keras for real-time audio processing.
Project Name : entrenar_wake.py
Location : ~/workspace/source/proyectos/make word/entrenar_wake.py
Model Type : LSTM-based binary classifier
System Architecture
Audio Configuration
# Path Configuration
BASE_PATH = "/home/daniel-de-anda/Escritorio/proyectos/make word"
PATH_POSITIVE = os.path.join( BASE_PATH , "audios" )
PATH_NEGATIVE = os.path.join( BASE_PATH , "audios bad" )
MODEL_SAVE_PATH = os.path.join( BASE_PATH , "wake_word_model.h5" )
# Audio Parameters
SAMPLE_RATE = 16000 # Hz
DURATION = 5 # seconds (unified duration)
CHANNELS = 1 # Mono audio
N_MFCC = 13 # Number of MFCC coefficients
The system uses Mel-Frequency Cepstral Coefficients (MFCC) to convert raw audio into machine learning features:
def extract_features ( audio_data ):
"""Extrae coeficientes MFCC de un array de audio."""
# Ensure constant length (pad or trim)
target_len = int ( SAMPLE_RATE * DURATION )
if len (audio_data) < target_len:
audio_data = np.pad(audio_data, ( 0 , target_len - len (audio_data)))
else :
audio_data = audio_data[:target_len]
# Extract MFCC features
mfcc = librosa.feature.mfcc( y = audio_data, sr = SAMPLE_RATE , n_mfcc = N_MFCC )
return mfcc.T # Shape: (time_steps, n_mfcc)
MFCC Features : Represents the short-term power spectrum of audio, commonly used in speech recognition. The 13 coefficients capture the essential characteristics of the audio signal.
Dataset Preparation
Data Loading
def prepare_dataset ():
x, y = [], []
for path, label in [( PATH_POSITIVE , 1 ), ( PATH_NEGATIVE , 0 )]:
print ( f "📦 Cargando audios { 'positivos' if label == 1 else 'negativos' } ..." )
if not os.path.exists(path):
print ( f "⚠️ Alerta: La carpeta { path } no existe." )
continue
for f in os.listdir(path):
if f.endswith( '.wav' ):
file_path = os.path.join(path, f)
audio, _ = librosa.load(file_path, sr = SAMPLE_RATE , duration = DURATION )
feat = extract_features(audio)
x.append(feat)
y.append(label)
return np.array(x), np.array(y)
Dataset Structure
Positive Samples
Negative Samples
audios/
├── wake_word_001.wav
├── wake_word_002.wav
├── wake_word_003.wav
└── ...
Label: 1 (Wake word detected)
Duration: 5 seconds each
Format: WAV, 16kHz mono
audios bad/
├── noise_001.wav
├── speech_002.wav
├── background_003.wav
└── ...
Label: 0 (Not wake word)
Duration: 5 seconds each
Format: WAV, 16kHz mono
Neural Network Architecture
LSTM Model
The model uses Long Short-Term Memory (LSTM) layers for sequence processing:
def build_model ( input_shape ):
model = models.Sequential([
layers.Input( shape = input_shape),
layers.LSTM( 64 , return_sequences = False ),
layers.Dropout( 0.3 ),
layers.Dense( 32 , activation = 'relu' ),
layers.Dense( 1 , activation = 'sigmoid' )
])
model.compile(
optimizer = 'adam' ,
loss = 'binary_crossentropy' ,
metrics = [ 'accuracy' ]
)
return model
Architecture Breakdown
Input Layer
Shape : (time_steps, 13)
time_steps: Varies based on audio length (5 seconds × frame rate)
13: Number of MFCC coefficients
LSTM Layer
Units : 64
return_sequences : False (only output final state)Processes temporal patterns in MFCC features to recognize the wake word sequence.
Dropout Layer
Rate : 0.3 (30% dropout)Prevents overfitting by randomly dropping 30% of neurons during training.
Dense Layer
Units : 32
Activation : ReLULearns high-level features from LSTM output.
Output Layer
Units : 1
Activation : SigmoidProduces probability (0-1) of wake word presence.
Training Configuration
model.fit(
x_train,
y_train,
epochs = 40 ,
batch_size = 8 ,
validation_split = 0.2
)
Number of complete passes through the training dataset
Number of samples processed before model update
Percentage of data reserved for validation (20%)
Training & Detection Mode
Main Workflow
def main ():
respuesta = input ( "¿Quieres entrenar el modelo? (s/n): " ).lower()
if respuesta == "s" :
# Training mode
x_train, y_train = prepare_dataset()
if len (x_train) == 0 :
print ( "❌ No hay datos suficientes." )
return
print ( f " \n 🧠 Entrenando con { len (x_train) } muestras..." )
model = build_model(x_train[ 0 ].shape)
model.fit(x_train, y_train, epochs = 40 , batch_size = 8 , validation_split = 0.2 )
model.save( MODEL_SAVE_PATH )
print ( f "✅ Modelo guardado en: { MODEL_SAVE_PATH } " )
else :
# Load existing model
if os.path.exists( MODEL_SAVE_PATH ):
print ( f "✅ Cargando modelo existente..." )
model = models.load_model( MODEL_SAVE_PATH )
else :
print ( "❌ No hay modelo para cargar. Debes entrenar primero." )
return
# Detection loop (continues below)
Real-Time Detection Loop
# Detection configuration
print ( " \n " + "=" * 40 )
print ( "🎤 MODO DETECCIÓN (Escuchando en bloques de 1s)" )
print ( "=" * 40 )
threshold = 0.8 # Confidence threshold
try :
while True :
# Record 5 seconds to match training data
rec = sd.rec(
int ( DURATION * SAMPLE_RATE ),
samplerate = SAMPLE_RATE ,
channels = CHANNELS
)
sd.wait()
audio_chunk = rec.flatten()
# Extract features and predict
feat = extract_features(audio_chunk)
feat = np.expand_dims(feat, axis = 0 ) # Batch dimension
prediction = model.predict(feat, verbose = 0 )[ 0 ][ 0 ]
if prediction > threshold:
print ( f "✅ ¡WAKE WORD DETECTADO! ( { prediction :.2f} )" )
else :
print ( f ". (Prob: { prediction :.2f} )" , end = " \r " )
except KeyboardInterrupt :
print ( " \n Deteniendo detector..." )
Installation
Dependencies
pip install numpy sounddevice tensorflow librosa
numpy Array operations and numerical computing
sounddevice Real-time audio capture from microphone
tensorflow Deep learning framework (includes Keras)
librosa Audio feature extraction and processing
Setup Steps
Create Directory Structure
mkdir -p "make word/audios"
mkdir -p "make word/audios bad"
Collect Audio Samples
Record wake word samples (positive) and background noise/other speech (negative):
Positive samples : 20-50 recordings of the wake word
Negative samples : 50-100 recordings of other audio
Format: WAV, 16kHz, mono, 5 seconds each
Train the Model
python entrenar_wake.py
# Choose 's' when prompted to train
Run Detection
python entrenar_wake.py
# Choose 'n' to load model and start detection
Usage Examples
Training Session
$ python entrenar_wake.py
¿Quieres entrenar el modelo? (s/n): s
📦 Cargando audios positivos...
📦 Cargando audios negativos...
🧠 Entrenando con 120 muestras...
Epoch 1/40
12/12 [==============================] - 2s 156ms/step - loss: 0.6821 - accuracy: 0.5625 - val_loss: 0.6745 - val_accuracy: 0.6250
Epoch 2/40
12/12 [==============================] - 1s 95ms/step - loss: 0.6523 - accuracy: 0.6458 - val_loss: 0.6234 - val_accuracy: 0.7083
...
Epoch 40/40
12/12 [==============================] - 1s 92ms/step - loss: 0.0823 - accuracy: 0.9792 - val_loss: 0.1234 - val_accuracy: 0.9583
✅ Modelo guardado en: /home/daniel-de-anda/Escritorio/proyectos/make word/wake_word_model.h5
Detection Session
$ python entrenar_wake.py
¿Quieres entrenar el modelo? (s/n): n
✅ Cargando modelo existente...
========================================
🎤 MODO DETECCIÓN (Escuchando en bloques de 1s )
========================================
. (Prob: 0.12 )
. (Prob: 0.08 )
. (Prob: 0.15 )
✅ ¡WAKE WORD DETECTADO! (0.92)
. (Prob: 0.23 )
. (Prob: 0.11 )
Threshold Adjustment
High Precision
Balanced
High Recall
threshold = 0.9 # Fewer false positives
More confident detections
May miss some wake words
Best for noisy environments
threshold = 0.8 # Default balance
Good balance of precision/recall
Recommended starting point
threshold = 0.6 # More detections
Catches more wake words
More false positives
Best for quiet environments
Model Optimization
More LSTM Units
Deeper Network
More MFCC Features
# Increase model capacity
layers.LSTM( 128 , return_sequences = False ) # From 64 to 128
Technical Details
MFCC Feature Dimensions
For a 5-second audio clip at 16kHz:
target_len = 16000 * 5 = 80000 samples
MFCC shape after extraction: (time_steps, 13 )
# time_steps depends on librosa's hop_length (default 512)
time_steps ≈ 80000 / 512 ≈ 156 frames
Final input shape: ( 156 , 13 )
Model Summary
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 64) 19968
_________________________________________________________________
dropout (Dropout) (None, 64) 0
_________________________________________________________________
dense (Dense) (None, 32) 2080
_________________________________________________________________
dense_1 (Dense) (None, 1) 33
=================================================================
Total params: 22,081
Trainable params: 22,081
Non-trainable params: 0
_________________________________________________________________
Best Practices
Data Quality : The model’s performance heavily depends on:
Diverse positive samples (different speakers, distances, accents)
Representative negative samples (common background noises)
Balanced dataset (roughly equal positive/negative samples)
Recording Tips :
Record in the actual environment where detection will occur
Include variations: loud, quiet, fast, slow pronunciations
Add negative samples with similar-sounding words
Use consistent audio format (16kHz, mono, WAV)
Troubleshooting
Issue Solution High false positives Increase threshold (0.85-0.95) or add more negative samples Missing detections Lower threshold (0.6-0.7) or add more positive samples Poor accuracy Collect more diverse training data (50+ each class) Model not saving Check write permissions on MODEL_SAVE_PATH
File Reference
Source : /home/daytona/workspace/source/proyectos/make word/entrenar_wake.py:1
Lines of Code : 123
Model Format : Keras HDF5 (.h5)
AI Voice Assistant Integrate wake word with full voice assistant
Intent Classification Classify user commands after wake word