What is RAG?
Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from your documents. Instead of relying solely on the model’s training data, RAG retrieves specific information from your knowledge base before generating an answer. This approach:- Reduces hallucinations by grounding responses in real data
- Enables answers based on private or recent information
- Allows you to update knowledge without retraining models
- Provides attribution through source documents
How Arcana’s RAG Pipeline Works
Arcana implements a complete RAG pipeline with six core steps:Pipeline Steps
Chunk
Split documents into overlapping segments for better retrieval granularity.Default configuration:
- Size: 450 tokens
- Overlap: 50 tokens
- Format-aware (markdown, code)
Embed
Convert text chunks into vector embeddings (numerical representations).Supported providers:
- Local Bumblebee models (default)
- OpenAI embeddings
- Custom providers
Store
Save embeddings in a vector database for efficient similarity search.Backends:
- pgvector (production)
- HNSWLib (in-memory testing)
lib/arcana/vector_store/pgvector.ex:28-79Search
Find relevant chunks by comparing query embeddings using cosine similarity.Search modes:
- Semantic (vector similarity)
- Full-text (PostgreSQL text search)
- Hybrid (combines both with RRF)
Augment
Build a prompt with retrieved context chunks.Default prompt structure:Location:
lib/arcana/ask.ex:99-118Basic RAG Example
Here’s how the complete pipeline works in practice:- Ingestion
- Search
- Ask
Advanced: Agentic RAG
For complex questions, Arcana provides an agentic pipeline with additional steps:Agentic Pipeline Steps
| Step | What it does | Implementation |
|---|---|---|
gate/2 | Skip retrieval if answerable from LLM knowledge | Prevents unnecessary searches |
rewrite/2 | Clean conversational input (“Hey, what is X?” → “What is X?”) | Improves search quality |
select/2 | Choose relevant collections based on question | LLM picks from available collections |
expand/2 | Add synonyms (“ML” → “ML machine learning models”) | Broadens search coverage |
decompose/2 | Split complex questions into sub-questions | Handles multi-part queries |
search/2 | Execute vector search (skipped if gated) | Core retrieval step |
reason/2 | Evaluate results and search again if insufficient | Multi-hop reasoning |
rerank/2 | Score each chunk 0-10 and filter by threshold | Improves precision |
answer/2 | Generate final answer using context or knowledge | Final response |
Every agentic step is pluggable - you can replace any component with a custom implementation. See the Agentic RAG Guide for details.
Pipeline Configuration
Configure pipeline components in yourconfig.exs:
Telemetry Events
Arcana emits telemetry events for every pipeline step:GraphRAG Enhancement
Optionally enhance retrieval with knowledge graphs:- Entity extraction (people, orgs, technologies)
- Relationship detection between entities
- Community clustering (Leiden algorithm)
- Fusion search (combines vector + graph results)
lib/arcana/graph/
See the GraphRAG Guide for details.
Best Practices
Chunk Size
Use 400-600 tokens for general content. Smaller chunks (200-300) for precise retrieval, larger (800-1000) for broader context.
Overlap
10-15% overlap ensures concepts spanning chunk boundaries aren’t lost. Default 50 tokens works well for 450-token chunks.
Search Limit
Retrieve 3-5 chunks for simple questions, 10-15 for complex queries. More context helps but increases LLM costs.
Hybrid Search
Use hybrid mode when users search with specific terms or names. Semantic-only works well for conceptual queries.
Next Steps
Chunking Strategies
Learn how to optimize text splitting for better retrieval
Embeddings
Understand vector representations and model selection
Search Modes
Compare semantic, full-text, and hybrid search
Agentic RAG
Build sophisticated RAG pipelines with multi-hop reasoning