AI Models

Overview

Grupo de Anda’s AI system includes three core model types:

Gemini API Integration - Cloud-based conversational AI using Google’s Gemini 1.5 Flash
LSTM Wake Word Detection - On-device neural network for voice activation
Intent Classification - Sklearn-based NLP pipeline for understanding user commands

Gemini API Integration

Configuration

The Gemini API requires configuration for model parameters and API authentication.

# API Authentication
API_KEY_GEMINI = "your-gemini-api-key"
GEMINI_MODEL = "gemini-1.5-flash"

# Generation Parameters
TEMP = 0.6            # Temperature (creativity)
TOP_K = 40            # Top-K sampling
TOP_P = 0.9           # Top-P (nucleus) sampling
MAX_TOKENS = 250      # Maximum output tokens

# Memory Configuration
MAX_HISTORIAL = 10    # Conversation history limit

KamutiniEngine Class

Main engine class for conversational AI with device control capabilities.

__init__

method

Initializes the Kamutini engine with TV device scanning and audio setup.

engine = KamutiniEngine()

Initialization Steps:

Scans local network for Roku TV devices
Sets up pygame mixer for audio output
Initializes conversation history list

procesar_gemini()

Processes user queries using Google’s Gemini API.

consulta

string

required

User’s input text to be processed by the AI

response

string

AI-generated response text with embedded command tags

engine = KamutiniEngine()
response = engine.procesar_gemini("¿Qué hora es?")
print(response)

API Endpoint:

https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={API_KEY}

responder()

Complete response pipeline with device control and memory management.

consulta

string

required

User’s query text

return

tuple

Returns (response_text: str, should_exit: bool)

response_text: Cleaned AI response without command tags
should_exit: Boolean indicating if conversation should end

engine = KamutiniEngine()
response, should_exit = engine.responder("Abre Netflix")
print(response)

if should_exit:
    print("Ending conversation")

Command Tags:

/*app(name)*/ - Opens specified Roku application
/*search(app, query)*/ - Searches content in app
/*home*/ - Returns to home screen
/*power*/ - Powers off TV
/*resultados(query)*/ - Performs Google search
/*salir*/ - Ends conversation

google_search_custom()

Performs custom Google searches using the Custom Search API.

query

string

required

Search query text

results

string

Formatted string with top 3 search result snippets, or error message

Example

results = google_search_custom("clima en guadalajara")
print(results)
# Output: "Rosario, encontré esto: [snippet 1] [snippet 2] [snippet 3]"

LSTM Wake Word Detection

Model Architecture

Deep learning model for real-time wake word detection using MFCC features.

# Audio Parameters
SAMPLE_RATE = 16000   # 16 kHz sampling rate
DURATION = 5          # Audio duration in seconds
CHANNELS = 1          # Mono audio
N_MFCC = 13           # Number of MFCC coefficients

# Model Parameters
LSTM_UNITS = 64       # LSTM layer size
DROPOUT = 0.3         # Dropout rate
DENSE_UNITS = 32      # Dense layer size

extract_features()

Extracts MFCC (Mel-Frequency Cepstral Coefficients) features from audio data.

audio_data

numpy.ndarray

required

Raw audio data as numpy array

mfcc

numpy.ndarray

MFCC feature matrix with shape (time_steps, n_mfcc)

Automatically pads or truncates audio to target length
Returns transposed MFCC matrix for LSTM input

Example

import librosa
import numpy as np

# Load audio file
audio, sr = librosa.load('wake_word.wav', sr=16000, duration=5)

# Extract features
mfcc_features = extract_features(audio)
print(mfcc_features.shape)  # (time_steps, 13)

Feature Processing:

Pads audio with zeros if too short
Truncates audio if too long
Computes 13 MFCC coefficients
Transposes to (time_steps, n_mfcc) format

prepare_dataset()

Loads and prepares training data from positive and negative audio samples.

return

tuple

Returns (X: np.ndarray, y: np.ndarray)

X: Feature arrays with shape (n_samples, time_steps, n_mfcc)
y: Binary labels (1 for wake word, 0 for other sounds)

Example

X_train, y_train = prepare_dataset()
print(f"Training samples: {len(X_train)}")
print(f"Feature shape: {X_train[0].shape}")
print(f"Labels: {np.unique(y_train)}")

Directory Structure:

make word/
├── audios/          # Positive samples (wake word)
├── audios bad/      # Negative samples (other sounds)
└── wake_word_model.h5  # Saved model

build_model()

Constructs the LSTM neural network architecture.

input_shape

tuple

required

Shape of input features (time_steps, n_mfcc)

model

tensorflow.keras.Model

Compiled Keras model ready for training

Loss: binary_crossentropy
Optimizer: adam
Metrics: accuracy

Example

X_train, y_train = prepare_dataset()
model = build_model(X_train[0].shape)

# Train the model
model.fit(
    X_train, y_train,
    epochs=40,
    batch_size=8,
    validation_split=0.2
)

# Save the model
model.save('wake_word_model.h5')

Real-time Detection

Live wake word detection using sounddevice.

Detection Loop

import sounddevice as sd
import tensorflow as tf

# Load trained model
model = tf.keras.models.load_model('wake_word_model.h5')

threshold = 0.8  # Confidence threshold

while True:
    # Record 5 seconds of audio
    rec = sd.rec(
        int(DURATION * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=CHANNELS
    )
    sd.wait()
    audio_chunk = rec.flatten()
    
    # Extract features and predict
    feat = extract_features(audio_chunk)
    feat = np.expand_dims(feat, axis=0)
    
    prediction = model.predict(feat, verbose=0)[0][0]
    
    if prediction > threshold:
        print(f"Wake word detected! (confidence: {prediction:.2f})")
        # Trigger action here

Intent Classification

Pipeline Architecture

Sklearn-based NLP pipeline using TF-IDF vectorization and SVM classification.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

modelo = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2),    # Unigrams and bigrams
        lowercase=True,         # Convert to lowercase
        stop_words=None         # No stopword removal
    )),
    ('clf', SVC(
        kernel='linear',        # Linear kernel
        probability=True,       # Enable probability estimates
        C=1.0                   # Regularization parameter
    ))
])

cargar_datos()

Loads training dataset from JSON file.

ruta_archivo

string

required

Path to JSON dataset file

data

list

List of dictionaries with ‘text’ and ‘intent’ keys

Example

data = cargar_datos('/path/to/dataset.json')
print(f"Loaded {len(data)} training examples")

Error Handling:

FileNotFoundError - Dataset file not found
json.JSONDecodeError - Invalid JSON format

entrenar_modelo()

Trains the intent classification model with validation.

dataset

list

required

List of training examples with ‘text’ and ‘intent’ fields

ruta_guardado

string

required

Path where trained model will be saved (.pkl file)

modelo

sklearn.pipeline.Pipeline

Trained pipeline model with TfidfVectorizer and SVC classifier

dataset = cargar_datos('dataset.json')
modelo = entrenar_modelo(dataset, 'modelo_entrenado.pkl')

# Model is automatically saved to disk
print("Model trained and saved successfully")

Training Process:

Validates dataset for required fields
Filters out invalid entries
Splits data (80/20 train/test)
Trains TF-IDF + SVM pipeline
Generates classification report
Saves model using joblib

Prediction Methods

predict()

Predicts intent for given text input.

text

list[string]

required

List of text strings to classify

intents

numpy.ndarray

Array of predicted intent labels

Example

modelo = joblib.load('modelo_entrenado.pkl')

# Single prediction
intencion = modelo.predict(["pon música relajante"])[0]
print(f"Intent: {intencion}")

# Batch prediction
textos = [
    "busca películas de acción",
    "apaga las luces",
    "cuál es el clima"
]
intenciones = modelo.predict(textos)
for texto, intent in zip(textos, intenciones):
    print(f"{texto} -> {intent}")

predict_proba()

Returns probability distributions for all possible intents.

text

list[string]

required

List of text strings to classify

probabilities

numpy.ndarray

2D array of shape (n_samples, n_classes) with probability for each class

Example with Confidence

modelo = joblib.load('modelo_entrenado.pkl')

texto = "abre netflix por favor"
intencion = modelo.predict([texto])[0]
probabilidades = modelo.predict_proba([texto])
confianza = max(probabilidades[0])

print(f"Intent: {intencion}")
print(f"Confidence: {confianza:.2%}")

# Output:
# Intent: open_app
# Confidence: 94.32%

iniciar_interfaz_chat()

Starts interactive command-line interface for testing the model.

modelo

sklearn.pipeline.Pipeline

required

Trained intent classification model

Interactive Testing

import joblib

modelo = joblib.load('modelo_entrenado.pkl')
iniciar_interfaz_chat(modelo)

# Interactive prompt:
# Tú: abre netflix
# Intención detectada: [open_app] (Confianza: 95.23%)
# ------------------------------

Features:

Real-time intent detection
Confidence scores for predictions
Type ‘salir’ to exit
Handles empty inputs gracefully

Common Patterns

Multi-Model Workflow

import numpy as np
import tensorflow as tf
import joblib

# 1. Load models
wake_model = tf.keras.models.load_model('wake_word_model.h5')
intent_model = joblib.load('modelo_entrenado.pkl')
gemini_engine = KamutiniEngine()

# 2. Wake word detection
audio = record_audio(duration=5)
features = extract_features(audio)
wake_prob = wake_model.predict(features)[0][0]

if wake_prob > 0.8:
    # 3. Speech-to-text (not shown)
    user_text = transcribe_audio(audio)
    
    # 4. Intent classification
    intent = intent_model.predict([user_text])[0]
    confidence = max(intent_model.predict_proba([user_text])[0])
    
    # 5. Generate response with Gemini
    response, should_exit = gemini_engine.responder(user_text)
    
    # 6. Text-to-speech
    gemini_engine.hablar_local(response)

Error Handling

Gemini API Errors

try:
    response = engine.procesar_gemini(consulta)
except requests.exceptions.Timeout:
    print("API request timed out after 15 seconds")
except KeyError:
    print("API response missing 'candidates' field")
    # Returns: "Rosario, hubo un tropiezo con el servicio de Google."

Common issues:

Invalid API key
Rate limiting
Network connectivity
Malformed responses

LSTM Model Errors

try:
    model = tf.keras.models.load_model('wake_word_model.h5')
except OSError:
    print("Model file not found - train the model first")

try:
    X_train, y_train = prepare_dataset()
    if len(X_train) == 0:
        raise ValueError("No training data found")
except Exception as e:
    print(f"Dataset preparation failed: {e}")

Common issues:

Missing audio files
Incorrect audio format
Model not trained
Audio device errors

Intent Classification Errors

try:
    data = cargar_datos('dataset.json')
except FileNotFoundError:
    print("Dataset file not found")
except json.JSONDecodeError:
    print("Invalid JSON format in dataset")

# Validate dataset entries
if not all('text' in item and 'intent' in item for item in data):
    print("Dataset entries missing required fields")

Common issues:

Missing dataset file
Invalid JSON format
Missing ‘text’ or ‘intent’ fields
Insufficient training data

Performance Optimization

Gemini API

Set appropriate MAX_TOKENS (250 recommended)
Adjust TEMP for creativity vs consistency
Limit MAX_HISTORIAL to reduce token usage
Use timeout=15 to prevent hanging

LSTM Model

Use batch_size=8 for training
Set DURATION=5 seconds for consistency
Apply dropout (0.3) to prevent overfitting
Use predict(verbose=0) for faster inference

Intent Classification

Linear kernel for faster training
Bigram features for better accuracy
Use C=1.0 for balanced regularization
Enable probability=True for confidence scores

Audio Processing

Use 16kHz sample rate (standard)
Extract 13 MFCC coefficients
Pad/truncate to fixed length
Use mono audio for efficiency

PyTorch Language Model

Overview

Custom GPT-style language model trainer with device auto-detection and dataset normalization. Source: ~/workspace/source/proyectos/ai creator/kamutini/modelo.py

Configuration

CONFIG = {
    "model_path": "modelo_ia_ligero.pth",
    "dataset_dir": "kamutini/datasets",
    "batch_size": 16,        
    "block_size": 256,       # Context window
    "n_embd": 512,           # Embedding dimension
    "n_head": 8,             # Attention heads
    "n_layer": 8,            # Transformer layers
    "dropout": 0.1,          
    "learning_rate": 5e-4,   
    "max_steps": 2000,       
    "temperature": 0.7,      # Generation randomness
    "top_k": 50,             # Top-k sampling
    "max_new_tokens": 500,   # Max generation length
    "eos_token": "<|endoftext|>"
}

Device Auto-Detection

Automatically selects the best available device.

get_best_device()

def get_best_device():
    """
    Returns:
        "cuda" if NVIDIA GPU available
        "mps" if Apple Silicon GPU available
        "cpu" otherwise
    """
    if torch.cuda.is_available():
        return "cuda"
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return "mps"
    else:
        return "cpu"

DEVICE = get_best_device()

The model automatically optimizes thread count for CPU training: torch.set_num_threads(os.cpu_count())

LocalOptimizedDataset

Custom PyTorch Dataset that uses DataNormalizer to load multiple JSON/CSV files.

Dataset Class

class LocalOptimizedDataset(Dataset):
    def __init__(self, directory_path, block_size, simple_load=False, chars_fixed=None):
        # Initialize DataNormalizer
        self.normalizer = info.DataNormalizer(eos_token=CONFIG['eos_token'])
        
        formatted_lines = []
        files = [f for f in os.listdir(directory_path) 
                if f.endswith(('.json', '.csv'))]
        
        print(f"Loading and normalizing {len(files)} files...")
        
        for filename in files:
            # Process JSON files
            if filename.endswith('.json'):
                with open(filepath, 'r', encoding='utf-8') as f:
                    raw_data = json.load(f)
                    if not isinstance(raw_data, list):
                        raw_data = [raw_data]
                    
                    for item in raw_data:
                        norm_text = self.normalizer.normalize_entry(item)
                        if norm_text:
                            formatted_lines.append(norm_text)
            
            # Process CSV files
            elif filename.endswith('.csv'):
                df = pd.read_csv(filepath)
                for item in df.to_dict('records'):
                    norm_text = self.normalizer.normalize_entry(item)
                    if norm_text:
                        formatted_lines.append(norm_text)
        
        text_data = "\n\n".join(formatted_lines)
        # Tokenization follows...

directory_path

str

required

Path to directory containing JSON/CSV training files

block_size

int

required

Context window size (tokens per training sample)

simple_load

bool

default:"False"

Skip advanced preprocessing if True

chars_fixed

list

Pre-defined character vocabulary (auto-detected if None)

Model Architecture

Transformer-based decoder with configurable depth.

Architecture Details

Embedding: Token and position embeddings (512-dim)
Transformer Blocks: 8 layers with multi-head attention (8 heads)
Dropout: 0.1 for regularization
Output: Linear layer to vocabulary size
Parameters: ~50M depending on vocabulary size

Training Pipeline

Training Example

# Load dataset
dataset = LocalOptimizedDataset(
    directory_path="kamutini/datasets",
    block_size=CONFIG['block_size']
)

# Create data loader
loader = DataLoader(
    dataset,
    batch_size=CONFIG['batch_size'],
    shuffle=True
)

# Train for specified steps
for step in range(CONFIG['max_steps']):
    x, y = next(iter(loader))
    x, y = x.to(DEVICE), y.to(DEVICE)
    
    logits, loss = model(x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if step % 100 == 0:
        print(f"Step {step}: Loss = {loss.item():.4f}")

# Save model
torch.save(model.state_dict(), CONFIG['model_path'])

Text Generation

Generation Example

def generate_text(model, prompt, max_tokens=500, temperature=0.7, top_k=50):
    """
    Generate text continuation from a prompt.
    
    Args:
        prompt: Starting text
        max_tokens: Maximum tokens to generate
        temperature: Randomness (0.0 = deterministic, 1.0 = creative)
        top_k: Sample from top-k most likely tokens
    
    Returns:
        Generated text string
    """
    model.eval()
    with torch.no_grad():
        # Encode prompt
        tokens = encode(prompt)
        x = torch.tensor([tokens], dtype=torch.long).to(DEVICE)
        
        # Generate
        for _ in range(max_tokens):
            logits = model(x)
            logits = logits[:, -1, :] / temperature
            
            # Top-k filtering
            if top_k > 0:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('Inf')
            
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            x = torch.cat([x, next_token], dim=1)
            
            # Check for EOS
            if decode([next_token.item()]) == CONFIG['eos_token']:
                break
        
        return decode(x[0].tolist())

# Usage
response = generate_text(
    model,
    prompt="### Humano: ¿Cómo estás?### Asistente:",
    temperature=0.7,
    top_k=50
)
print(response)

Model training requires significant RAM (4GB+) and benefits greatly from GPU acceleration. Training on CPU is possible but slow (hours vs minutes).

For faster iteration, start with max_steps=500 and n_layer=4 to validate your dataset, then scale up to full configuration.

Python Modules

Overview

Gemini API Integration

Configuration

KamutiniEngine Class

procesar_gemini()

responder()

google_search_custom()

LSTM Wake Word Detection

Model Architecture

extract_features()

prepare_dataset()

build_model()

Real-time Detection

Intent Classification

Pipeline Architecture

cargar_datos()

entrenar_modelo()

Prediction Methods

predict()

predict_proba()

iniciar_interfaz_chat()

Common Patterns

Multi-Model Workflow

Error Handling

Performance Optimization

Gemini API

LSTM Model

Intent Classification

Audio Processing

PyTorch Language Model

Overview

Device Auto-Detection

LocalOptimizedDataset

Model Architecture

Training Pipeline

Text Generation

Build docs developers (and LLMs) love

Python Modules

​Overview

​Gemini API Integration

​Configuration

​KamutiniEngine Class

​procesar_gemini()

​responder()

​google_search_custom()

​LSTM Wake Word Detection

​Model Architecture

​extract_features()

​prepare_dataset()

​build_model()

​Real-time Detection

​Intent Classification

​Pipeline Architecture

​cargar_datos()

​entrenar_modelo()

​Prediction Methods

​predict()

​predict_proba()

​iniciar_interfaz_chat()

​Common Patterns

​Multi-Model Workflow

​Error Handling

​Performance Optimization

Gemini API

LSTM Model

Intent Classification

Audio Processing

​PyTorch Language Model

​Overview

​Device Auto-Detection

​LocalOptimizedDataset

​Model Architecture

​Training Pipeline

​Text Generation

Build docs developers (and LLMs) love

Overview

Gemini API Integration

Configuration

KamutiniEngine Class

procesar_gemini()

responder()

google_search_custom()

LSTM Wake Word Detection

Model Architecture

extract_features()

prepare_dataset()

build_model()

Real-time Detection

Intent Classification

Pipeline Architecture

cargar_datos()

entrenar_modelo()

Prediction Methods

predict()

predict_proba()

iniciar_interfaz_chat()

Common Patterns

Multi-Model Workflow

Error Handling

Performance Optimization

PyTorch Language Model

Overview

Device Auto-Detection

LocalOptimizedDataset

Model Architecture

Training Pipeline

Text Generation