Overview
Grupo de Anda’s AI system includes three core model types:
Gemini API Integration - Cloud-based conversational AI using Google’s Gemini 1.5 Flash
LSTM Wake Word Detection - On-device neural network for voice activation
Intent Classification - Sklearn-based NLP pipeline for understanding user commands
Gemini API Integration
Configuration
The Gemini API requires configuration for model parameters and API authentication.
Configuration
Search Integration
# API Authentication
API_KEY_GEMINI = "your-gemini-api-key"
GEMINI_MODEL = "gemini-1.5-flash"
# Generation Parameters
TEMP = 0.6 # Temperature (creativity)
TOP_K = 40 # Top-K sampling
TOP_P = 0.9 # Top-P (nucleus) sampling
MAX_TOKENS = 250 # Maximum output tokens
# Memory Configuration
MAX_HISTORIAL = 10 # Conversation history limit
KamutiniEngine Class
Main engine class for conversational AI with device control capabilities.
Initializes the Kamutini engine with TV device scanning and audio setup. engine = KamutiniEngine()
Initialization Steps:
Scans local network for Roku TV devices
Sets up pygame mixer for audio output
Initializes conversation history list
procesar_gemini()
Processes user queries using Google’s Gemini API.
User’s input text to be processed by the AI
AI-generated response text with embedded command tags
Usage
API Payload Structure
engine = KamutiniEngine()
response = engine.procesar_gemini( "¿Qué hora es?" )
print (response)
API Endpoint:
https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={API_KEY}
responder()
Complete response pipeline with device control and memory management.
Returns (response_text: str, should_exit: bool)
response_text: Cleaned AI response without command tags
should_exit: Boolean indicating if conversation should end
Basic Usage
Command Tag Processing
engine = KamutiniEngine()
response, should_exit = engine.responder( "Abre Netflix" )
print (response)
if should_exit:
print ( "Ending conversation" )
Command Tags:
/*app(name)*/ - Opens specified Roku application
/*search(app, query)*/ - Searches content in app
/*home*/ - Returns to home screen
/*power*/ - Powers off TV
/*resultados(query)*/ - Performs Google search
/*salir*/ - Ends conversation
google_search_custom()
Performs custom Google searches using the Custom Search API.
Formatted string with top 3 search result snippets, or error message
results = google_search_custom( "clima en guadalajara" )
print (results)
# Output: "Rosario, encontré esto: [snippet 1] [snippet 2] [snippet 3]"
LSTM Wake Word Detection
Model Architecture
Deep learning model for real-time wake word detection using MFCC features.
Model Configuration
Architecture
# Audio Parameters
SAMPLE_RATE = 16000 # 16 kHz sampling rate
DURATION = 5 # Audio duration in seconds
CHANNELS = 1 # Mono audio
N_MFCC = 13 # Number of MFCC coefficients
# Model Parameters
LSTM_UNITS = 64 # LSTM layer size
DROPOUT = 0.3 # Dropout rate
DENSE_UNITS = 32 # Dense layer size
Extracts MFCC (Mel-Frequency Cepstral Coefficients) features from audio data.
Raw audio data as numpy array
MFCC feature matrix with shape (time_steps, n_mfcc)
Automatically pads or truncates audio to target length
Returns transposed MFCC matrix for LSTM input
import librosa
import numpy as np
# Load audio file
audio, sr = librosa.load( 'wake_word.wav' , sr = 16000 , duration = 5 )
# Extract features
mfcc_features = extract_features(audio)
print (mfcc_features.shape) # (time_steps, 13)
Feature Processing:
Pads audio with zeros if too short
Truncates audio if too long
Computes 13 MFCC coefficients
Transposes to (time_steps, n_mfcc) format
prepare_dataset()
Loads and prepares training data from positive and negative audio samples.
Returns (X: np.ndarray, y: np.ndarray)
X: Feature arrays with shape (n_samples, time_steps, n_mfcc)
y: Binary labels (1 for wake word, 0 for other sounds)
X_train, y_train = prepare_dataset()
print ( f "Training samples: { len (X_train) } " )
print ( f "Feature shape: { X_train[ 0 ].shape } " )
print ( f "Labels: { np.unique(y_train) } " )
Directory Structure:
make word/
├── audios/ # Positive samples (wake word)
├── audios bad/ # Negative samples (other sounds)
└── wake_word_model.h5 # Saved model
build_model()
Constructs the LSTM neural network architecture.
Shape of input features (time_steps, n_mfcc)
Compiled Keras model ready for training
Loss: binary_crossentropy
Optimizer: adam
Metrics: accuracy
X_train, y_train = prepare_dataset()
model = build_model(X_train[ 0 ].shape)
# Train the model
model.fit(
X_train, y_train,
epochs = 40 ,
batch_size = 8 ,
validation_split = 0.2
)
# Save the model
model.save( 'wake_word_model.h5' )
Real-time Detection
Live wake word detection using sounddevice.
import sounddevice as sd
import tensorflow as tf
# Load trained model
model = tf.keras.models.load_model( 'wake_word_model.h5' )
threshold = 0.8 # Confidence threshold
while True :
# Record 5 seconds of audio
rec = sd.rec(
int ( DURATION * SAMPLE_RATE ),
samplerate = SAMPLE_RATE ,
channels = CHANNELS
)
sd.wait()
audio_chunk = rec.flatten()
# Extract features and predict
feat = extract_features(audio_chunk)
feat = np.expand_dims(feat, axis = 0 )
prediction = model.predict(feat, verbose = 0 )[ 0 ][ 0 ]
if prediction > threshold:
print ( f "Wake word detected! (confidence: { prediction :.2f} )" )
# Trigger action here
Intent Classification
Pipeline Architecture
Sklearn-based NLP pipeline using TF-IDF vectorization and SVM classification.
Pipeline Configuration
Dataset Format
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
modelo = Pipeline([
( 'tfidf' , TfidfVectorizer(
ngram_range = ( 1 , 2 ), # Unigrams and bigrams
lowercase = True , # Convert to lowercase
stop_words = None # No stopword removal
)),
( 'clf' , SVC(
kernel = 'linear' , # Linear kernel
probability = True , # Enable probability estimates
C = 1.0 # Regularization parameter
))
])
cargar_datos()
Loads training dataset from JSON file.
Path to JSON dataset file
List of dictionaries with ‘text’ and ‘intent’ keys
data = cargar_datos( '/path/to/dataset.json' )
print ( f "Loaded { len (data) } training examples" )
Error Handling:
FileNotFoundError - Dataset file not found
json.JSONDecodeError - Invalid JSON format
entrenar_modelo()
Trains the intent classification model with validation.
List of training examples with ‘text’ and ‘intent’ fields
Path where trained model will be saved (.pkl file)
modelo
sklearn.pipeline.Pipeline
Trained pipeline model with TfidfVectorizer and SVC classifier
Training Example
Loading Existing Model
dataset = cargar_datos( 'dataset.json' )
modelo = entrenar_modelo(dataset, 'modelo_entrenado.pkl' )
# Model is automatically saved to disk
print ( "Model trained and saved successfully" )
Training Process:
Validates dataset for required fields
Filters out invalid entries
Splits data (80/20 train/test)
Trains TF-IDF + SVM pipeline
Generates classification report
Saves model using joblib
Prediction Methods
predict()
Predicts intent for given text input.
List of text strings to classify
Array of predicted intent labels
modelo = joblib.load( 'modelo_entrenado.pkl' )
# Single prediction
intencion = modelo.predict([ "pon música relajante" ])[ 0 ]
print ( f "Intent: { intencion } " )
# Batch prediction
textos = [
"busca películas de acción" ,
"apaga las luces" ,
"cuál es el clima"
]
intenciones = modelo.predict(textos)
for texto, intent in zip (textos, intenciones):
print ( f " { texto } -> { intent } " )
predict_proba()
Returns probability distributions for all possible intents.
List of text strings to classify
2D array of shape (n_samples, n_classes) with probability for each class
modelo = joblib.load( 'modelo_entrenado.pkl' )
texto = "abre netflix por favor"
intencion = modelo.predict([texto])[ 0 ]
probabilidades = modelo.predict_proba([texto])
confianza = max (probabilidades[ 0 ])
print ( f "Intent: { intencion } " )
print ( f "Confidence: { confianza :.2%} " )
# Output:
# Intent: open_app
# Confidence: 94.32%
iniciar_interfaz_chat()
Starts interactive command-line interface for testing the model.
modelo
sklearn.pipeline.Pipeline
required
Trained intent classification model
import joblib
modelo = joblib.load( 'modelo_entrenado.pkl' )
iniciar_interfaz_chat(modelo)
# Interactive prompt:
# Tú: abre netflix
# Intención detectada: [open_app] (Confianza: 95.23%)
# ------------------------------
Features:
Real-time intent detection
Confidence scores for predictions
Type ‘salir’ to exit
Handles empty inputs gracefully
Common Patterns
Multi-Model Workflow
Voice Assistant Pipeline
Batch Intent Processing
import numpy as np
import tensorflow as tf
import joblib
# 1. Load models
wake_model = tf.keras.models.load_model( 'wake_word_model.h5' )
intent_model = joblib.load( 'modelo_entrenado.pkl' )
gemini_engine = KamutiniEngine()
# 2. Wake word detection
audio = record_audio( duration = 5 )
features = extract_features(audio)
wake_prob = wake_model.predict(features)[ 0 ][ 0 ]
if wake_prob > 0.8 :
# 3. Speech-to-text (not shown)
user_text = transcribe_audio(audio)
# 4. Intent classification
intent = intent_model.predict([user_text])[ 0 ]
confidence = max (intent_model.predict_proba([user_text])[ 0 ])
# 5. Generate response with Gemini
response, should_exit = gemini_engine.responder(user_text)
# 6. Text-to-speech
gemini_engine.hablar_local(response)
Error Handling
try :
response = engine.procesar_gemini(consulta)
except requests.exceptions.Timeout:
print ( "API request timed out after 15 seconds" )
except KeyError :
print ( "API response missing 'candidates' field" )
# Returns: "Rosario, hubo un tropiezo con el servicio de Google."
Common issues:
Invalid API key
Rate limiting
Network connectivity
Malformed responses
try :
model = tf.keras.models.load_model( 'wake_word_model.h5' )
except OSError :
print ( "Model file not found - train the model first" )
try :
X_train, y_train = prepare_dataset()
if len (X_train) == 0 :
raise ValueError ( "No training data found" )
except Exception as e:
print ( f "Dataset preparation failed: { e } " )
Common issues:
Missing audio files
Incorrect audio format
Model not trained
Audio device errors
Intent Classification Errors
try :
data = cargar_datos( 'dataset.json' )
except FileNotFoundError :
print ( "Dataset file not found" )
except json.JSONDecodeError:
print ( "Invalid JSON format in dataset" )
# Validate dataset entries
if not all ( 'text' in item and 'intent' in item for item in data):
print ( "Dataset entries missing required fields" )
Common issues:
Missing dataset file
Invalid JSON format
Missing ‘text’ or ‘intent’ fields
Insufficient training data
Gemini API
Set appropriate MAX_TOKENS (250 recommended)
Adjust TEMP for creativity vs consistency
Limit MAX_HISTORIAL to reduce token usage
Use timeout=15 to prevent hanging
LSTM Model
Use batch_size=8 for training
Set DURATION=5 seconds for consistency
Apply dropout (0.3) to prevent overfitting
Use predict(verbose=0) for faster inference
Intent Classification
Linear kernel for faster training
Bigram features for better accuracy
Use C=1.0 for balanced regularization
Enable probability=True for confidence scores
Audio Processing
Use 16kHz sample rate (standard)
Extract 13 MFCC coefficients
Pad/truncate to fixed length
Use mono audio for efficiency
PyTorch Language Model
Overview
Custom GPT-style language model trainer with device auto-detection and dataset normalization.
Source : ~/workspace/source/proyectos/ai creator/kamutini/modelo.py
CONFIG = {
"model_path" : "modelo_ia_ligero.pth" ,
"dataset_dir" : "kamutini/datasets" ,
"batch_size" : 16 ,
"block_size" : 256 , # Context window
"n_embd" : 512 , # Embedding dimension
"n_head" : 8 , # Attention heads
"n_layer" : 8 , # Transformer layers
"dropout" : 0.1 ,
"learning_rate" : 5e-4 ,
"max_steps" : 2000 ,
"temperature" : 0.7 , # Generation randomness
"top_k" : 50 , # Top-k sampling
"max_new_tokens" : 500 , # Max generation length
"eos_token" : "<|endoftext|>"
}
Device Auto-Detection
Automatically selects the best available device.
def get_best_device ():
"""
Returns:
"cuda" if NVIDIA GPU available
"mps" if Apple Silicon GPU available
"cpu" otherwise
"""
if torch.cuda.is_available():
return "cuda"
elif hasattr (torch.backends, "mps" ) and torch.backends.mps.is_available():
return "mps"
else :
return "cpu"
DEVICE = get_best_device()
The model automatically optimizes thread count for CPU training: torch.set_num_threads(os.cpu_count())
LocalOptimizedDataset
Custom PyTorch Dataset that uses DataNormalizer to load multiple JSON/CSV files.
class LocalOptimizedDataset ( Dataset ):
def __init__ ( self , directory_path , block_size , simple_load = False , chars_fixed = None ):
# Initialize DataNormalizer
self .normalizer = info.DataNormalizer( eos_token = CONFIG [ 'eos_token' ])
formatted_lines = []
files = [f for f in os.listdir(directory_path)
if f.endswith(( '.json' , '.csv' ))]
print ( f "Loading and normalizing { len (files) } files..." )
for filename in files:
# Process JSON files
if filename.endswith( '.json' ):
with open (filepath, 'r' , encoding = 'utf-8' ) as f:
raw_data = json.load(f)
if not isinstance (raw_data, list ):
raw_data = [raw_data]
for item in raw_data:
norm_text = self .normalizer.normalize_entry(item)
if norm_text:
formatted_lines.append(norm_text)
# Process CSV files
elif filename.endswith( '.csv' ):
df = pd.read_csv(filepath)
for item in df.to_dict( 'records' ):
norm_text = self .normalizer.normalize_entry(item)
if norm_text:
formatted_lines.append(norm_text)
text_data = " \n\n " .join(formatted_lines)
# Tokenization follows...
Path to directory containing JSON/CSV training files
Context window size (tokens per training sample)
Skip advanced preprocessing if True
Pre-defined character vocabulary (auto-detected if None)
Model Architecture
Transformer-based decoder with configurable depth.
Embedding : Token and position embeddings (512-dim)
Transformer Blocks : 8 layers with multi-head attention (8 heads)
Dropout : 0.1 for regularization
Output : Linear layer to vocabulary size
Parameters : ~50M depending on vocabulary size
Training Pipeline
# Load dataset
dataset = LocalOptimizedDataset(
directory_path = "kamutini/datasets" ,
block_size = CONFIG [ 'block_size' ]
)
# Create data loader
loader = DataLoader(
dataset,
batch_size = CONFIG [ 'batch_size' ],
shuffle = True
)
# Train for specified steps
for step in range ( CONFIG [ 'max_steps' ]):
x, y = next ( iter (loader))
x, y = x.to( DEVICE ), y.to( DEVICE )
logits, loss = model(x, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step % 100 == 0 :
print ( f "Step { step } : Loss = { loss.item() :.4f} " )
# Save model
torch.save(model.state_dict(), CONFIG [ 'model_path' ])
Text Generation
def generate_text ( model , prompt , max_tokens = 500 , temperature = 0.7 , top_k = 50 ):
"""
Generate text continuation from a prompt.
Args:
prompt: Starting text
max_tokens: Maximum tokens to generate
temperature: Randomness (0.0 = deterministic, 1.0 = creative)
top_k: Sample from top-k most likely tokens
Returns:
Generated text string
"""
model.eval()
with torch.no_grad():
# Encode prompt
tokens = encode(prompt)
x = torch.tensor([tokens], dtype = torch.long).to( DEVICE )
# Generate
for _ in range (max_tokens):
logits = model(x)
logits = logits[:, - 1 , :] / temperature
# Top-k filtering
if top_k > 0 :
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [ - 1 ]]] = - float ( 'Inf' )
probs = F.softmax(logits, dim =- 1 )
next_token = torch.multinomial(probs, num_samples = 1 )
x = torch.cat([x, next_token], dim = 1 )
# Check for EOS
if decode([next_token.item()]) == CONFIG [ 'eos_token' ]:
break
return decode(x[ 0 ].tolist())
# Usage
response = generate_text(
model,
prompt = "### Humano: ¿Cómo estás?### Asistente:" ,
temperature = 0.7 ,
top_k = 50
)
print (response)
Model training requires significant RAM (4GB+) and benefits greatly from GPU acceleration. Training on CPU is possible but slow (hours vs minutes).
For faster iteration, start with max_steps=500 and n_layer=4 to validate your dataset, then scale up to full configuration.