Skip to main content

Overview

Grupo de Anda’s AI system includes three core model types:
  • Gemini API Integration - Cloud-based conversational AI using Google’s Gemini 1.5 Flash
  • LSTM Wake Word Detection - On-device neural network for voice activation
  • Intent Classification - Sklearn-based NLP pipeline for understanding user commands

Gemini API Integration

Configuration

The Gemini API requires configuration for model parameters and API authentication.
# API Authentication
API_KEY_GEMINI = "your-gemini-api-key"
GEMINI_MODEL = "gemini-1.5-flash"

# Generation Parameters
TEMP = 0.6            # Temperature (creativity)
TOP_K = 40            # Top-K sampling
TOP_P = 0.9           # Top-P (nucleus) sampling
MAX_TOKENS = 250      # Maximum output tokens

# Memory Configuration
MAX_HISTORIAL = 10    # Conversation history limit

KamutiniEngine Class

Main engine class for conversational AI with device control capabilities.
__init__
method
Initializes the Kamutini engine with TV device scanning and audio setup.
engine = KamutiniEngine()
Initialization Steps:
  • Scans local network for Roku TV devices
  • Sets up pygame mixer for audio output
  • Initializes conversation history list

procesar_gemini()

Processes user queries using Google’s Gemini API.
consulta
string
required
User’s input text to be processed by the AI
response
string
AI-generated response text with embedded command tags
engine = KamutiniEngine()
response = engine.procesar_gemini("¿Qué hora es?")
print(response)
API Endpoint:
https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={API_KEY}

responder()

Complete response pipeline with device control and memory management.
consulta
string
required
User’s query text
return
tuple
Returns (response_text: str, should_exit: bool)
  • response_text: Cleaned AI response without command tags
  • should_exit: Boolean indicating if conversation should end
engine = KamutiniEngine()
response, should_exit = engine.responder("Abre Netflix")
print(response)

if should_exit:
    print("Ending conversation")
Command Tags:
  • /*app(name)*/ - Opens specified Roku application
  • /*search(app, query)*/ - Searches content in app
  • /*home*/ - Returns to home screen
  • /*power*/ - Powers off TV
  • /*resultados(query)*/ - Performs Google search
  • /*salir*/ - Ends conversation

google_search_custom()

Performs custom Google searches using the Custom Search API.
query
string
required
Search query text
results
string
Formatted string with top 3 search result snippets, or error message
Example
results = google_search_custom("clima en guadalajara")
print(results)
# Output: "Rosario, encontré esto: [snippet 1] [snippet 2] [snippet 3]"

LSTM Wake Word Detection

Model Architecture

Deep learning model for real-time wake word detection using MFCC features.
# Audio Parameters
SAMPLE_RATE = 16000   # 16 kHz sampling rate
DURATION = 5          # Audio duration in seconds
CHANNELS = 1          # Mono audio
N_MFCC = 13           # Number of MFCC coefficients

# Model Parameters
LSTM_UNITS = 64       # LSTM layer size
DROPOUT = 0.3         # Dropout rate
DENSE_UNITS = 32      # Dense layer size

extract_features()

Extracts MFCC (Mel-Frequency Cepstral Coefficients) features from audio data.
audio_data
numpy.ndarray
required
Raw audio data as numpy array
mfcc
numpy.ndarray
MFCC feature matrix with shape (time_steps, n_mfcc)
  • Automatically pads or truncates audio to target length
  • Returns transposed MFCC matrix for LSTM input
Example
import librosa
import numpy as np

# Load audio file
audio, sr = librosa.load('wake_word.wav', sr=16000, duration=5)

# Extract features
mfcc_features = extract_features(audio)
print(mfcc_features.shape)  # (time_steps, 13)
Feature Processing:
  1. Pads audio with zeros if too short
  2. Truncates audio if too long
  3. Computes 13 MFCC coefficients
  4. Transposes to (time_steps, n_mfcc) format

prepare_dataset()

Loads and prepares training data from positive and negative audio samples.
return
tuple
Returns (X: np.ndarray, y: np.ndarray)
  • X: Feature arrays with shape (n_samples, time_steps, n_mfcc)
  • y: Binary labels (1 for wake word, 0 for other sounds)
Example
X_train, y_train = prepare_dataset()
print(f"Training samples: {len(X_train)}")
print(f"Feature shape: {X_train[0].shape}")
print(f"Labels: {np.unique(y_train)}")
Directory Structure:
make word/
├── audios/          # Positive samples (wake word)
├── audios bad/      # Negative samples (other sounds)
└── wake_word_model.h5  # Saved model

build_model()

Constructs the LSTM neural network architecture.
input_shape
tuple
required
Shape of input features (time_steps, n_mfcc)
model
tensorflow.keras.Model
Compiled Keras model ready for training
  • Loss: binary_crossentropy
  • Optimizer: adam
  • Metrics: accuracy
Example
X_train, y_train = prepare_dataset()
model = build_model(X_train[0].shape)

# Train the model
model.fit(
    X_train, y_train,
    epochs=40,
    batch_size=8,
    validation_split=0.2
)

# Save the model
model.save('wake_word_model.h5')

Real-time Detection

Live wake word detection using sounddevice.
Detection Loop
import sounddevice as sd
import tensorflow as tf

# Load trained model
model = tf.keras.models.load_model('wake_word_model.h5')

threshold = 0.8  # Confidence threshold

while True:
    # Record 5 seconds of audio
    rec = sd.rec(
        int(DURATION * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=CHANNELS
    )
    sd.wait()
    audio_chunk = rec.flatten()
    
    # Extract features and predict
    feat = extract_features(audio_chunk)
    feat = np.expand_dims(feat, axis=0)
    
    prediction = model.predict(feat, verbose=0)[0][0]
    
    if prediction > threshold:
        print(f"Wake word detected! (confidence: {prediction:.2f})")
        # Trigger action here

Intent Classification

Pipeline Architecture

Sklearn-based NLP pipeline using TF-IDF vectorization and SVM classification.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

modelo = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2),    # Unigrams and bigrams
        lowercase=True,         # Convert to lowercase
        stop_words=None         # No stopword removal
    )),
    ('clf', SVC(
        kernel='linear',        # Linear kernel
        probability=True,       # Enable probability estimates
        C=1.0                   # Regularization parameter
    ))
])

cargar_datos()

Loads training dataset from JSON file.
ruta_archivo
string
required
Path to JSON dataset file
data
list
List of dictionaries with ‘text’ and ‘intent’ keys
Example
data = cargar_datos('/path/to/dataset.json')
print(f"Loaded {len(data)} training examples")
Error Handling:
  • FileNotFoundError - Dataset file not found
  • json.JSONDecodeError - Invalid JSON format

entrenar_modelo()

Trains the intent classification model with validation.
dataset
list
required
List of training examples with ‘text’ and ‘intent’ fields
ruta_guardado
string
required
Path where trained model will be saved (.pkl file)
modelo
sklearn.pipeline.Pipeline
Trained pipeline model with TfidfVectorizer and SVC classifier
dataset = cargar_datos('dataset.json')
modelo = entrenar_modelo(dataset, 'modelo_entrenado.pkl')

# Model is automatically saved to disk
print("Model trained and saved successfully")
Training Process:
  1. Validates dataset for required fields
  2. Filters out invalid entries
  3. Splits data (80/20 train/test)
  4. Trains TF-IDF + SVM pipeline
  5. Generates classification report
  6. Saves model using joblib

Prediction Methods

predict()

Predicts intent for given text input.
text
list[string]
required
List of text strings to classify
intents
numpy.ndarray
Array of predicted intent labels
Example
modelo = joblib.load('modelo_entrenado.pkl')

# Single prediction
intencion = modelo.predict(["pon música relajante"])[0]
print(f"Intent: {intencion}")

# Batch prediction
textos = [
    "busca películas de acción",
    "apaga las luces",
    "cuál es el clima"
]
intenciones = modelo.predict(textos)
for texto, intent in zip(textos, intenciones):
    print(f"{texto} -> {intent}")

predict_proba()

Returns probability distributions for all possible intents.
text
list[string]
required
List of text strings to classify
probabilities
numpy.ndarray
2D array of shape (n_samples, n_classes) with probability for each class
Example with Confidence
modelo = joblib.load('modelo_entrenado.pkl')

texto = "abre netflix por favor"
intencion = modelo.predict([texto])[0]
probabilidades = modelo.predict_proba([texto])
confianza = max(probabilidades[0])

print(f"Intent: {intencion}")
print(f"Confidence: {confianza:.2%}")

# Output:
# Intent: open_app
# Confidence: 94.32%

iniciar_interfaz_chat()

Starts interactive command-line interface for testing the model.
modelo
sklearn.pipeline.Pipeline
required
Trained intent classification model
Interactive Testing
import joblib

modelo = joblib.load('modelo_entrenado.pkl')
iniciar_interfaz_chat(modelo)

# Interactive prompt:
# Tú: abre netflix
# Intención detectada: [open_app] (Confianza: 95.23%)
# ------------------------------
Features:
  • Real-time intent detection
  • Confidence scores for predictions
  • Type ‘salir’ to exit
  • Handles empty inputs gracefully

Common Patterns

Multi-Model Workflow

import numpy as np
import tensorflow as tf
import joblib

# 1. Load models
wake_model = tf.keras.models.load_model('wake_word_model.h5')
intent_model = joblib.load('modelo_entrenado.pkl')
gemini_engine = KamutiniEngine()

# 2. Wake word detection
audio = record_audio(duration=5)
features = extract_features(audio)
wake_prob = wake_model.predict(features)[0][0]

if wake_prob > 0.8:
    # 3. Speech-to-text (not shown)
    user_text = transcribe_audio(audio)
    
    # 4. Intent classification
    intent = intent_model.predict([user_text])[0]
    confidence = max(intent_model.predict_proba([user_text])[0])
    
    # 5. Generate response with Gemini
    response, should_exit = gemini_engine.responder(user_text)
    
    # 6. Text-to-speech
    gemini_engine.hablar_local(response)

Error Handling

try:
    response = engine.procesar_gemini(consulta)
except requests.exceptions.Timeout:
    print("API request timed out after 15 seconds")
except KeyError:
    print("API response missing 'candidates' field")
    # Returns: "Rosario, hubo un tropiezo con el servicio de Google."
Common issues:
  • Invalid API key
  • Rate limiting
  • Network connectivity
  • Malformed responses
try:
    model = tf.keras.models.load_model('wake_word_model.h5')
except OSError:
    print("Model file not found - train the model first")

try:
    X_train, y_train = prepare_dataset()
    if len(X_train) == 0:
        raise ValueError("No training data found")
except Exception as e:
    print(f"Dataset preparation failed: {e}")
Common issues:
  • Missing audio files
  • Incorrect audio format
  • Model not trained
  • Audio device errors
try:
    data = cargar_datos('dataset.json')
except FileNotFoundError:
    print("Dataset file not found")
except json.JSONDecodeError:
    print("Invalid JSON format in dataset")

# Validate dataset entries
if not all('text' in item and 'intent' in item for item in data):
    print("Dataset entries missing required fields")
Common issues:
  • Missing dataset file
  • Invalid JSON format
  • Missing ‘text’ or ‘intent’ fields
  • Insufficient training data

Performance Optimization

Gemini API

  • Set appropriate MAX_TOKENS (250 recommended)
  • Adjust TEMP for creativity vs consistency
  • Limit MAX_HISTORIAL to reduce token usage
  • Use timeout=15 to prevent hanging

LSTM Model

  • Use batch_size=8 for training
  • Set DURATION=5 seconds for consistency
  • Apply dropout (0.3) to prevent overfitting
  • Use predict(verbose=0) for faster inference

Intent Classification

  • Linear kernel for faster training
  • Bigram features for better accuracy
  • Use C=1.0 for balanced regularization
  • Enable probability=True for confidence scores

Audio Processing

  • Use 16kHz sample rate (standard)
  • Extract 13 MFCC coefficients
  • Pad/truncate to fixed length
  • Use mono audio for efficiency

PyTorch Language Model

Overview

Custom GPT-style language model trainer with device auto-detection and dataset normalization. Source: ~/workspace/source/proyectos/ai creator/kamutini/modelo.py
Configuration
CONFIG = {
    "model_path": "modelo_ia_ligero.pth",
    "dataset_dir": "kamutini/datasets",
    "batch_size": 16,        
    "block_size": 256,       # Context window
    "n_embd": 512,           # Embedding dimension
    "n_head": 8,             # Attention heads
    "n_layer": 8,            # Transformer layers
    "dropout": 0.1,          
    "learning_rate": 5e-4,   
    "max_steps": 2000,       
    "temperature": 0.7,      # Generation randomness
    "top_k": 50,             # Top-k sampling
    "max_new_tokens": 500,   # Max generation length
    "eos_token": "<|endoftext|>"
}

Device Auto-Detection

Automatically selects the best available device.
get_best_device()
def get_best_device():
    """
    Returns:
        "cuda" if NVIDIA GPU available
        "mps" if Apple Silicon GPU available
        "cpu" otherwise
    """
    if torch.cuda.is_available():
        return "cuda"
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return "mps"
    else:
        return "cpu"

DEVICE = get_best_device()
The model automatically optimizes thread count for CPU training: torch.set_num_threads(os.cpu_count())

LocalOptimizedDataset

Custom PyTorch Dataset that uses DataNormalizer to load multiple JSON/CSV files.
Dataset Class
class LocalOptimizedDataset(Dataset):
    def __init__(self, directory_path, block_size, simple_load=False, chars_fixed=None):
        # Initialize DataNormalizer
        self.normalizer = info.DataNormalizer(eos_token=CONFIG['eos_token'])
        
        formatted_lines = []
        files = [f for f in os.listdir(directory_path) 
                if f.endswith(('.json', '.csv'))]
        
        print(f"Loading and normalizing {len(files)} files...")
        
        for filename in files:
            # Process JSON files
            if filename.endswith('.json'):
                with open(filepath, 'r', encoding='utf-8') as f:
                    raw_data = json.load(f)
                    if not isinstance(raw_data, list):
                        raw_data = [raw_data]
                    
                    for item in raw_data:
                        norm_text = self.normalizer.normalize_entry(item)
                        if norm_text:
                            formatted_lines.append(norm_text)
            
            # Process CSV files
            elif filename.endswith('.csv'):
                df = pd.read_csv(filepath)
                for item in df.to_dict('records'):
                    norm_text = self.normalizer.normalize_entry(item)
                    if norm_text:
                        formatted_lines.append(norm_text)
        
        text_data = "\n\n".join(formatted_lines)
        # Tokenization follows...
directory_path
str
required
Path to directory containing JSON/CSV training files
block_size
int
required
Context window size (tokens per training sample)
simple_load
bool
default:"False"
Skip advanced preprocessing if True
chars_fixed
list
Pre-defined character vocabulary (auto-detected if None)

Model Architecture

Transformer-based decoder with configurable depth.
  • Embedding: Token and position embeddings (512-dim)
  • Transformer Blocks: 8 layers with multi-head attention (8 heads)
  • Dropout: 0.1 for regularization
  • Output: Linear layer to vocabulary size
  • Parameters: ~50M depending on vocabulary size

Training Pipeline

Training Example
# Load dataset
dataset = LocalOptimizedDataset(
    directory_path="kamutini/datasets",
    block_size=CONFIG['block_size']
)

# Create data loader
loader = DataLoader(
    dataset,
    batch_size=CONFIG['batch_size'],
    shuffle=True
)

# Train for specified steps
for step in range(CONFIG['max_steps']):
    x, y = next(iter(loader))
    x, y = x.to(DEVICE), y.to(DEVICE)
    
    logits, loss = model(x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if step % 100 == 0:
        print(f"Step {step}: Loss = {loss.item():.4f}")

# Save model
torch.save(model.state_dict(), CONFIG['model_path'])

Text Generation

Generation Example
def generate_text(model, prompt, max_tokens=500, temperature=0.7, top_k=50):
    """
    Generate text continuation from a prompt.
    
    Args:
        prompt: Starting text
        max_tokens: Maximum tokens to generate
        temperature: Randomness (0.0 = deterministic, 1.0 = creative)
        top_k: Sample from top-k most likely tokens
    
    Returns:
        Generated text string
    """
    model.eval()
    with torch.no_grad():
        # Encode prompt
        tokens = encode(prompt)
        x = torch.tensor([tokens], dtype=torch.long).to(DEVICE)
        
        # Generate
        for _ in range(max_tokens):
            logits = model(x)
            logits = logits[:, -1, :] / temperature
            
            # Top-k filtering
            if top_k > 0:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('Inf')
            
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            x = torch.cat([x, next_token], dim=1)
            
            # Check for EOS
            if decode([next_token.item()]) == CONFIG['eos_token']:
                break
        
        return decode(x[0].tolist())

# Usage
response = generate_text(
    model,
    prompt="### Humano: ¿Cómo estás?### Asistente:",
    temperature=0.7,
    top_k=50
)
print(response)
Model training requires significant RAM (4GB+) and benefits greatly from GPU acceleration. Training on CPU is possible but slow (hours vs minutes).
For faster iteration, start with max_steps=500 and n_layer=4 to validate your dataset, then scale up to full configuration.

Build docs developers (and LLMs) love