Skip to main content

What is RAG?

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from your documents. Instead of relying solely on the model’s training data, RAG retrieves specific information from your knowledge base before generating an answer. This approach:
  • Reduces hallucinations by grounding responses in real data
  • Enables answers based on private or recent information
  • Allows you to update knowledge without retraining models
  • Provides attribution through source documents

How Arcana’s RAG Pipeline Works

Arcana implements a complete RAG pipeline with six core steps:
┌─────────────────────────────────────────────────────────┐
│                     Your Phoenix App                    │
├─────────────────────────────────────────────────────────┤
│                    Arcana.Agent                         │
│  (rewrite → select → expand → search → rerank → answer) │
├─────────────────────────────────────────────────────────┤
│  Arcana.ask/2   │  Arcana.search/2  │  Arcana.ingest/2  │
├─────────────────┴───────────────────┴───────────────────┤
│                                                         │
│  ┌─────────────┐  ┌─────────────────┐  ┌─────────────┐  │
│  │   Chunker   │  │   Embeddings    │  │   Search    │  │
│  │ (splitting) │  │   (Bumblebee)   │  │ (pgvector)  │  │
│  └─────────────┘  └─────────────────┘  └─────────────┘  │
│                                                         │
├─────────────────────────────────────────────────────────┤
│              Your Existing Ecto Repo                    │
│         PostgreSQL + pgvector extension                 │
└─────────────────────────────────────────────────────────┘

Pipeline Steps

1

Chunk

Split documents into overlapping segments for better retrieval granularity.Default configuration:
  • Size: 450 tokens
  • Overlap: 50 tokens
  • Format-aware (markdown, code)
See Chunking Strategies for details.
2

Embed

Convert text chunks into vector embeddings (numerical representations).Supported providers:
  • Local Bumblebee models (default)
  • OpenAI embeddings
  • Custom providers
See Embeddings for details.
3

Store

Save embeddings in a vector database for efficient similarity search.Backends:
  • pgvector (production)
  • HNSWLib (in-memory testing)
Location: lib/arcana/vector_store/pgvector.ex:28-79
4

Search

Find relevant chunks by comparing query embeddings using cosine similarity.Search modes:
  • Semantic (vector similarity)
  • Full-text (PostgreSQL text search)
  • Hybrid (combines both with RRF)
See Search Modes for details.
5

Augment

Build a prompt with retrieved context chunks.Default prompt structure:
Answer the user's question based on the following context.
If the answer is not in the context, say you don't know.

Context:
[chunk 1]
---
[chunk 2]
Location: lib/arcana/ask.ex:99-118
6

Generate

Send the augmented prompt to an LLM to generate the final answer.Supported LLMs:
  • OpenAI (via Req.LLM)
  • Anthropic (via Req.LLM)
  • Custom implementations
Location: lib/arcana/ask.ex:80-97

Basic RAG Example

Here’s how the complete pipeline works in practice:
# Step 1-3: Chunk → Embed → Store
{:ok, document} = Arcana.ingest(
  """
  Elixir is a dynamic, functional language designed for building 
  scalable and maintainable applications. It runs on the BEAM VM.
  """,
  repo: MyApp.Repo,
  collection: "elixir-docs"
)

# Behind the scenes:
# 1. Text split into chunks (lib/arcana/chunker/default.ex:56-68)
# 2. Each chunk embedded (lib/arcana/ingest.ex:125-156)
# 3. Stored in pgvector (lib/arcana/ingest.ex:133-145)

Advanced: Agentic RAG

For complex questions, Arcana provides an agentic pipeline with additional steps:
alias Arcana.Agent

ctx =
  Agent.new("Compare Elixir and Erlang", repo: MyApp.Repo, llm: llm)
  |> Agent.gate()        # Decide if retrieval is needed
  |> Agent.rewrite()     # Clean up conversational input
  |> Agent.expand()      # Add synonyms and related terms
  |> Agent.decompose()   # Split multi-part questions
  |> Agent.search()      # Execute vector search
  |> Agent.reason()      # Multi-hop: search again if needed
  |> Agent.rerank()      # Score chunk relevance (0-10)
  |> Agent.answer()      # Generate final answer

ctx.answer

Agentic Pipeline Steps

StepWhat it doesImplementation
gate/2Skip retrieval if answerable from LLM knowledgePrevents unnecessary searches
rewrite/2Clean conversational input (“Hey, what is X?” → “What is X?”)Improves search quality
select/2Choose relevant collections based on questionLLM picks from available collections
expand/2Add synonyms (“ML” → “ML machine learning models”)Broadens search coverage
decompose/2Split complex questions into sub-questionsHandles multi-part queries
search/2Execute vector search (skipped if gated)Core retrieval step
reason/2Evaluate results and search again if insufficientMulti-hop reasoning
rerank/2Score each chunk 0-10 and filter by thresholdImproves precision
answer/2Generate final answer using context or knowledgeFinal response
Every agentic step is pluggable - you can replace any component with a custom implementation. See the Agentic RAG Guide for details.

Pipeline Configuration

Configure pipeline components in your config.exs:
config :arcana,
  # Chunking
  chunker: {:default, chunk_size: 512, chunk_overlap: 100},
  
  # Embeddings
  embedder: {:local, model: "BAAI/bge-small-en-v1.5"},
  # embedder: {:openai, model: "text-embedding-3-large"},
  
  # Vector store
  vector_store: :pgvector,  # or :memory for testing
  
  # LLM
  llm: "openai:gpt-4o-mini"
  # llm: "anthropic:claude-sonnet-4-20250514"

Telemetry Events

Arcana emits telemetry events for every pipeline step:
:telemetry.attach_many(
  "arcana-handler",
  [
    [:arcana, :ingest, :start],
    [:arcana, :ingest, :stop],
    [:arcana, :search, :start],
    [:arcana, :search, :stop],
    [:arcana, :embed, :start],
    [:arcana, :embed, :stop],
    [:arcana, :ask, :start],
    [:arcana, :ask, :stop]
  ],
  &MyApp.TelemetryHandler.handle_event/4,
  nil
)
See the Telemetry Guide for monitoring and debugging.

GraphRAG Enhancement

Optionally enhance retrieval with knowledge graphs:
# Ingest with entity extraction
{:ok, document} = Arcana.ingest(content, 
  repo: MyApp.Repo, 
  graph: true
)

# Search combines vector + graph traversal with RRF
{:ok, results} = Arcana.search("Who leads OpenAI?", 
  repo: MyApp.Repo,
  graph: true
)
GraphRAG adds:
  1. Entity extraction (people, orgs, technologies)
  2. Relationship detection between entities
  3. Community clustering (Leiden algorithm)
  4. Fusion search (combines vector + graph results)
Location: lib/arcana/graph/ See the GraphRAG Guide for details.

Best Practices

Chunk Size

Use 400-600 tokens for general content. Smaller chunks (200-300) for precise retrieval, larger (800-1000) for broader context.

Overlap

10-15% overlap ensures concepts spanning chunk boundaries aren’t lost. Default 50 tokens works well for 450-token chunks.

Search Limit

Retrieve 3-5 chunks for simple questions, 10-15 for complex queries. More context helps but increases LLM costs.

Hybrid Search

Use hybrid mode when users search with specific terms or names. Semantic-only works well for conceptual queries.
Context Window Limits: Ensure total context size fits your LLM’s window. GPT-4o has 128K tokens, but costs scale with context size.

Next Steps

Chunking Strategies

Learn how to optimize text splitting for better retrieval

Embeddings

Understand vector representations and model selection

Search Modes

Compare semantic, full-text, and hybrid search

Agentic RAG

Build sophisticated RAG pipelines with multi-hop reasoning

Build docs developers (and LLMs) love