What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation. Instead of relying solely on an LLM’s training data, RAG retrieves relevant information from a knowledge base and provides it as context for generation.RAG solves a critical problem: hallucination. Without RAG, LLMs might recommend products that don’t exist or make up features. With RAG, responses are grounded in real database records.
Why RAG for product search?
Traditional approaches have significant limitations:Pure LLM generation (no RAG)
Pure LLM generation (no RAG)
Problem: The LLM only knows what was in its training dataProblem: This product doesn’t exist in your database!
- Can’t access your current product catalog
- May recommend discontinued products
- Invents product names and features
- No way to ensure accuracy
Fine-tuned models
Fine-tuned models
Problem: Expensive and quickly becomes outdated
- Requires expensive GPU time for training
- Must retrain every time products change
- High latency to reflect new products
- Risk of overfitting to training data
Traditional keyword search
Traditional keyword search
Problem: Can’t understand semantic meaning
- Only matches exact keywords
- Misses relevant products with different phrasing
- Poor handling of synonyms and related concepts
- No natural language understanding
RAG combines the best of both worlds
RAG merges semantic search with constrained generation:- Vector search finds semantically relevant products
- LLM generation creates natural, helpful responses
- Context constraint ensures recommendations only come from retrieved products
Implementation in SKU Semantic Search
The RAG implementation happens in two phases: retrieval and generation.Phase 1: Retrieval
The retrieval phase uses vector embeddings and cosine similarity (app/services/product_service.py:26):
- Embedding model: Google Gemini
gemini-embedding-001produces 3072-dimensional vectors - Similarity metric: Cosine distance measures angular similarity in vector space
- Limit: Returns top N most similar products (default: 5)
Phase 2: Generation
Retrieved products are formatted as structured context for the LLM (app/api/endpoints/products.py:14):
generate_answer method constructs a prompt that constrains the LLM (app/services/llm_service.py:64):
- System role: Establishes the AI’s expertise (“Listo ERP analyst”)
- Context injection: Retrieved products are explicitly provided
- Instruction clarity: Requests professional, concise responses
- Constraint: “Based on this context” guides the LLM to use only provided products
Example RAG workflow
Let’s trace a complete request:Vector search retrieves products
PostgreSQL finds the closest product embeddings:Note: The system found “Shampoo Anticaspa” even though the query used different words!
Advantages of this RAG implementation
Always current
New products are immediately searchable once their embeddings are created. No retraining required.
Factually grounded
AI can only recommend products that exist in the database, eliminating hallucinations.
Semantic understanding
Finds relevant products even with loose or creative phrasing (“algo para la caspa” → “Shampoo Anticaspa”).
Natural responses
LLM generates human-like explanations instead of just returning product IDs.
Explainable
You can see exactly which products influenced each recommendation.
Multilingual
Works across languages with appropriate embedding models.
RAG best practices
Chunk size and retrieval limit
Chunk size and retrieval limit
Current setting: Top 5 productsConsiderations:
- Too few results: May miss relevant products
- Too many results: Context window overflow, slower generation
- Optimal: 3-10 products for most queries
Prompt engineering
Prompt engineering
Current prompt structure:Improvements to consider:
- Add few-shot examples of good recommendations
- Include structured output formatting instructions
- Specify how to handle no-match scenarios
- Add business rules (e.g., “prioritize in-stock items”)
Context formatting
Context formatting
Current format: Simple concatenationEnhanced options:
Hybrid search
Hybrid search
Current approach: Pure vector searchPotential enhancement: Combine vector search with keyword filtersThis ensures semantic relevance while respecting business constraints.
Measuring RAG quality
Key metrics to track:- Retrieval accuracy: Are the right products being retrieved?
- Response relevance: Do LLM recommendations make sense?
- Hallucination rate: Does the AI invent products not in context?
- User satisfaction: Do users find what they need?
Next steps
Multi-LLM failover
Learn how the system handles LLM provider failures
Architecture overview
See how RAG fits into the complete system architecture
Database setup
Optimize PostgreSQL and pgvector for RAG workloads
Environment setup
Configure API keys and LLM settings