RAG Pipeline

What is RAG?

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from your documents. Instead of relying solely on the model’s training data, RAG retrieves specific information from your knowledge base before generating an answer. This approach:

Reduces hallucinations by grounding responses in real data
Enables answers based on private or recent information
Allows you to update knowledge without retraining models
Provides attribution through source documents

How Arcana’s RAG Pipeline Works

Arcana implements a complete RAG pipeline with six core steps:

┌─────────────────────────────────────────────────────────┐
│                     Your Phoenix App                    │
├─────────────────────────────────────────────────────────┤
│                    Arcana.Agent                         │
│  (rewrite → select → expand → search → rerank → answer) │
├─────────────────────────────────────────────────────────┤
│  Arcana.ask/2   │  Arcana.search/2  │  Arcana.ingest/2  │
├─────────────────┴───────────────────┴───────────────────┤
│                                                         │
│  ┌─────────────┐  ┌─────────────────┐  ┌─────────────┐  │
│  │   Chunker   │  │   Embeddings    │  │   Search    │  │
│  │ (splitting) │  │   (Bumblebee)   │  │ (pgvector)  │  │
│  └─────────────┘  └─────────────────┘  └─────────────┘  │
│                                                         │
├─────────────────────────────────────────────────────────┤
│              Your Existing Ecto Repo                    │
│         PostgreSQL + pgvector extension                 │
└─────────────────────────────────────────────────────────┘

Pipeline Steps

Chunk

Split documents into overlapping segments for better retrieval granularity.Default configuration:

Size: 450 tokens
Overlap: 50 tokens
Format-aware (markdown, code)

See Chunking Strategies for details.

Embed

Convert text chunks into vector embeddings (numerical representations).Supported providers:

Local Bumblebee models (default)
OpenAI embeddings
Custom providers

See Embeddings for details.

Store

Save embeddings in a vector database for efficient similarity search.Backends:

pgvector (production)
HNSWLib (in-memory testing)

Location: lib/arcana/vector_store/pgvector.ex:28-79

Find relevant chunks by comparing query embeddings using cosine similarity.Search modes:

Semantic (vector similarity)
Full-text (PostgreSQL text search)
Hybrid (combines both with RRF)

See Search Modes for details.

Augment

Build a prompt with retrieved context chunks.Default prompt structure:

Answer the user's question based on the following context.
If the answer is not in the context, say you don't know.

Context:
[chunk 1]
---
[chunk 2]

Location: lib/arcana/ask.ex:99-118

Generate

Send the augmented prompt to an LLM to generate the final answer.Supported LLMs:

OpenAI (via Req.LLM)
Anthropic (via Req.LLM)
Custom implementations

Location: lib/arcana/ask.ex:80-97

Basic RAG Example

Here’s how the complete pipeline works in practice:

Ingestion
Search
Ask

# Step 1-3: Chunk → Embed → Store
{:ok, document} = Arcana.ingest(
  """
  Elixir is a dynamic, functional language designed for building 
  scalable and maintainable applications. It runs on the BEAM VM.
  """,
  repo: MyApp.Repo,
  collection: "elixir-docs"
)

# Behind the scenes:
# 1. Text split into chunks (lib/arcana/chunker/default.ex:56-68)
# 2. Each chunk embedded (lib/arcana/ingest.ex:125-156)
# 3. Stored in pgvector (lib/arcana/ingest.ex:133-145)

# Step 4: Search for relevant chunks
{:ok, results} = Arcana.search(
  "What is Elixir?",
  repo: MyApp.Repo,
  collection: "elixir-docs",
  limit: 5
)

# Returns:
# [
#   %{
#     id: "...",
#     text: "Elixir is a dynamic, functional language...",
#     score: 0.89,
#     document_id: "...",
#     chunk_index: 0
#   }
# ]

# Step 4-6: Search → Augment → Generate
{:ok, answer, context} = Arcana.ask(
  "What is Elixir?",
  repo: MyApp.Repo,
  llm: "openai:gpt-4o-mini",
  collection: "elixir-docs"
)

# Behind the scenes:
# 1. Query embedded and searched (lib/arcana/search.ex:228-246)
# 2. Context chunks retrieved
# 3. Prompt built with context (lib/arcana/ask.ex:99-118)
# 4. LLM generates answer (lib/arcana/ask.ex:84-88)

IO.puts(answer)
# => "Elixir is a dynamic, functional programming language..."

Advanced: Agentic RAG

For complex questions, Arcana provides an agentic pipeline with additional steps:

alias Arcana.Agent

ctx =
  Agent.new("Compare Elixir and Erlang", repo: MyApp.Repo, llm: llm)
  |> Agent.gate()        # Decide if retrieval is needed
  |> Agent.rewrite()     # Clean up conversational input
  |> Agent.expand()      # Add synonyms and related terms
  |> Agent.decompose()   # Split multi-part questions
  |> Agent.search()      # Execute vector search
  |> Agent.reason()      # Multi-hop: search again if needed
  |> Agent.rerank()      # Score chunk relevance (0-10)
  |> Agent.answer()      # Generate final answer

ctx.answer

Agentic Pipeline Steps

Step	What it does	Implementation
`gate/2`	Skip retrieval if answerable from LLM knowledge	Prevents unnecessary searches
`rewrite/2`	Clean conversational input (“Hey, what is X?” → “What is X?”)	Improves search quality
`select/2`	Choose relevant collections based on question	LLM picks from available collections
`expand/2`	Add synonyms (“ML” → “ML machine learning models”)	Broadens search coverage
`decompose/2`	Split complex questions into sub-questions	Handles multi-part queries
`search/2`	Execute vector search (skipped if gated)	Core retrieval step
`reason/2`	Evaluate results and search again if insufficient	Multi-hop reasoning
`rerank/2`	Score each chunk 0-10 and filter by threshold	Improves precision
`answer/2`	Generate final answer using context or knowledge	Final response

Every agentic step is pluggable - you can replace any component with a custom implementation. See the Agentic RAG Guide for details.

Pipeline Configuration

Configure pipeline components in your config.exs:

config :arcana,
  # Chunking
  chunker: {:default, chunk_size: 512, chunk_overlap: 100},
  
  # Embeddings
  embedder: {:local, model: "BAAI/bge-small-en-v1.5"},
  # embedder: {:openai, model: "text-embedding-3-large"},
  
  # Vector store
  vector_store: :pgvector,  # or :memory for testing
  
  # LLM
  llm: "openai:gpt-4o-mini"
  # llm: "anthropic:claude-sonnet-4-20250514"

Telemetry Events

Arcana emits telemetry events for every pipeline step:

:telemetry.attach_many(
  "arcana-handler",
  [
    [:arcana, :ingest, :start],
    [:arcana, :ingest, :stop],
    [:arcana, :search, :start],
    [:arcana, :search, :stop],
    [:arcana, :embed, :start],
    [:arcana, :embed, :stop],
    [:arcana, :ask, :start],
    [:arcana, :ask, :stop]
  ],
  &MyApp.TelemetryHandler.handle_event/4,
  nil
)

See the Telemetry Guide for monitoring and debugging.

GraphRAG Enhancement

Optionally enhance retrieval with knowledge graphs:

# Ingest with entity extraction
{:ok, document} = Arcana.ingest(content, 
  repo: MyApp.Repo, 
  graph: true
)

# Search combines vector + graph traversal with RRF
{:ok, results} = Arcana.search("Who leads OpenAI?", 
  repo: MyApp.Repo,
  graph: true
)

GraphRAG adds:

Entity extraction (people, orgs, technologies)
Relationship detection between entities
Community clustering (Leiden algorithm)
Fusion search (combines vector + graph results)

Location: lib/arcana/graph/ See the GraphRAG Guide for details.

Best Practices

Chunk Size

Use 400-600 tokens for general content. Smaller chunks (200-300) for precise retrieval, larger (800-1000) for broader context.

Overlap

10-15% overlap ensures concepts spanning chunk boundaries aren’t lost. Default 50 tokens works well for 450-token chunks.

Search Limit

Retrieve 3-5 chunks for simple questions, 10-15 for complex queries. More context helps but increases LLM costs.

Hybrid Search

Use hybrid mode when users search with specific terms or names. Semantic-only works well for conceptual queries.

Context Window Limits: Ensure total context size fits your LLM’s window. GPT-4o has 128K tokens, but costs scale with context size.

Next Steps

Chunking Strategies

Learn how to optimize text splitting for better retrieval

Embeddings

Understand vector representations and model selection

Search Modes

Compare semantic, full-text, and hybrid search

Agentic RAG

Build sophisticated RAG pipelines with multi-hop reasoning

Getting Started

Core Concepts

Guides

Configuration

What is RAG?

How Arcana’s RAG Pipeline Works

Pipeline Steps

Basic RAG Example

Advanced: Agentic RAG

Agentic Pipeline Steps

Pipeline Configuration

Telemetry Events

GraphRAG Enhancement

Best Practices

Chunk Size

Overlap

Search Limit

Hybrid Search

Next Steps

Chunking Strategies

Embeddings

Search Modes

Agentic RAG

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

​What is RAG?

​How Arcana’s RAG Pipeline Works

​Pipeline Steps

​Basic RAG Example

​Advanced: Agentic RAG

​Agentic Pipeline Steps

​Pipeline Configuration

​Telemetry Events

​GraphRAG Enhancement

​Best Practices

Chunk Size

Overlap

Search Limit

Hybrid Search

​Next Steps

Chunking Strategies

Embeddings

Search Modes

Agentic RAG

Build docs developers (and LLMs) love

What is RAG?

How Arcana’s RAG Pipeline Works

Pipeline Steps

Basic RAG Example

Advanced: Agentic RAG

Agentic Pipeline Steps

Pipeline Configuration

Telemetry Events

GraphRAG Enhancement

Best Practices

Next Steps