Skip to main content
Retrieval-Augmented Generation (RAG) combines large language models with external knowledge bases to provide accurate, contextual responses grounded in your data.

What is RAG?

RAG workflows:
  1. Ingest documents into a vector database
  2. Retrieve relevant chunks based on user queries
  3. Augment LLM prompts with retrieved context
  4. Generate responses grounded in your data

Basic RAG Pattern

Simple RAG with Agno

From rag_apps/agentic_rag/main.py:
from agno.agent import Agent
from agno.knowledge.url import UrlKnowledge
from agno.vectordb.lancedb import LanceDb, SearchType
from agno.embedder.openai import OpenAIEmbedder
from agno.models.openai import OpenAIChat

def load_knowledge_base(urls: list[str] = None):
    """
    Create knowledge base from URLs.
    """
    knowledge_base = UrlKnowledge(
        urls=urls or [],
        vector_db=LanceDb(
            table_name="mcp-docs-knowledge-base",
            uri="tmp/lancedb",
            search_type=SearchType.vector,
            embedder=OpenAIEmbedder(id="text-embedding-3-small"),
        ),
    )
    knowledge_base.load()  # Ingest and vectorize documents
    return knowledge_base

def agentic_rag_response(urls: list[str], query: str):
    # Load knowledge base
    knowledge_base = load_knowledge_base(urls)
    
    # Create agent with knowledge
    agent = Agent(
        model=OpenAIChat(id="gpt-5-2025-08-07"),
        knowledge=knowledge_base,
        search_knowledge=True,  # Automatically search before responding
        markdown=True,
    )
    
    # Query with RAG
    response = agent.run(query, stream=True)
    return response

# Usage
urls = [
    "https://modelcontextprotocol.io/docs/learn/architecture.md",
    "https://modelcontextprotocol.io/docs/concepts/resources.md"
]

response = agentic_rag_response(
    urls=urls,
    query="Tell me about MCP primitives that clients can expose."
)

for chunk in response:
    if hasattr(chunk, 'event') and chunk.event == "RunResponseContent":
        print(chunk.content, end="")

How It Works

  1. Ingestion: knowledge_base.load() downloads URLs, chunks content, generates embeddings
  2. Storage: Embeddings stored in LanceDB vector database
  3. Retrieval: When user queries, relevant chunks retrieved via vector search
  4. Generation: Agent uses retrieved context to generate accurate response

Contextual AI RAG

Contextual AI provides advanced RAG with document understanding and contextual embeddings. From rag_apps/contextual_ai_rag/main.py:
import os
from contextual import ContextualAI
from llama_index.llms.nebius import NebiusLLM

# Initialize Contextual AI client
client = ContextualAI(api_key=os.getenv("CONTEXTUAL_API_KEY"))

# Step 1: Create datastore
datastore = client.datastores.create(name="my-docs")

# Step 2: Upload documents
with open("document.pdf", "rb") as f:
    client.datastores.documents.ingest(
        datastore_id=datastore.id,
        file=("document.pdf", f, "application/pdf")
    )

# Step 3: Create RAG agent
agent = client.agents.create(
    name="my-agent",
    description="RAG agent for document Q&A",
    datastore_ids=[datastore.id]
)

# Step 4: Query with RAG
response = client.agents.query.create(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "Summarize the key findings"}]
)

print(response.message.content)

# Step 5: Get source references
if response.retrieval_contents:
    for content in response.retrieval_contents:
        ret_info = client.agents.query.retrieval_info(
            message_id=response.message_id,
            agent_id=agent.id,
            content_ids=[content.content_id]
        )
        # Access source metadata and page images
        print(ret_info.content_metadatas[0])

Enhanced RAG with Multiple Models

def enhance_with_nebius(original_response: str, query: str) -> str:
    """
    Use Nebius to enhance Contextual AI response.
    """
    nebius_llm = NebiusLLM(
        model="Qwen/Qwen3-235B-A22B",
        api_key=os.getenv("NEBIUS_API_KEY")
    )
    
    enhancement_prompt = f"""
    Based on the original query and AI response below,
    provide key insights and suggest follow-up questions.
    
    Original Query: {query}
    AI Response: {original_response}
    
    Enhancement:
    """
    
    enhanced = nebius_llm.complete(enhancement_prompt)
    return f"{original_response}\n\n**Enhanced Insights:**\n{enhanced}"

# Use in RAG workflow
response = client.agents.query.create(agent_id=agent.id, messages=[...])
enhanced = enhance_with_nebius(response.message.content, query)

Combine RAG with web search for up-to-date information. From rag_apps/agentic_rag_with_web_search/:
from crewai import Agent, Task, Crew
from crewai_tools import SerperDevTool
from qdrant_tool import QdrantSearchTool  # Custom RAG tool

# Web search tool
web_search = SerperDevTool()

# RAG tool for local knowledge
rag_tool = QdrantSearchTool(
    collection_name="documentation",
    url="http://localhost:6333"
)

# Researcher agent with both tools
researcher = Agent(
    role="Research Specialist",
    goal="Find accurate information from docs and web",
    tools=[rag_tool, web_search],
    backstory="Expert at combining internal docs with latest web info"
)

# Analyst agent
analyst = Agent(
    role="Data Analyst",
    goal="Synthesize information into insights",
    backstory="Expert at analyzing and summarizing research"
)

# Tasks
research_task = Task(
    description="Research {topic} using both documentation and web",
    agent=researcher,
    expected_output="Comprehensive research findings"
)

analysis_task = Task(
    description="Analyze research and create summary",
    agent=analyst,
    expected_output="Executive summary with key insights"
)

# Execute
crew = Crew(
    agents=[researcher, analyst],
    tasks=[research_task, analysis_task]
)

result = crew.kickoff(inputs={"topic": "Latest AI agent frameworks"})
print(result)

LlamaIndex RAG

LlamaIndex provides powerful document processing and indexing. From rag_apps/llamaIndex_starter/:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Configure chunking
node_parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50
)

# Create embeddings
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Build index
index = VectorStoreIndex.from_documents(
    documents,
    node_parser=node_parser,
    embed_model=embed_model
)

# Create query engine
query_engine = index.as_query_engine(
    llm=OpenAI(model="gpt-4-turbo"),
    similarity_top_k=5  # Retrieve top 5 chunks
)

# Query
response = query_engine.query("What are the main features?")
print(response)

# Access source nodes
for node in response.source_nodes:
    print(f"Score: {node.score}")
    print(f"Text: {node.text}")
    print(f"Metadata: {node.metadata}")

Advanced LlamaIndex Features

from llama_index.core import StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
import qdrant_client

# Connect to Qdrant
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(
    client=client,
    collection_name="documents"
)

storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)

# Build index with Qdrant backend
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=embed_model
)

# Custom retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,  # Retrieve 10 candidates
)

# Post-processor to filter by score
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.7)

# Query engine with custom retrieval
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[postprocessor],
    llm=llm
)

response = query_engine.query("Explain the architecture")

PDF RAG with OCR

Process PDFs with images, charts, and complex layouts.

Gemma OCR RAG

From rag_apps/gemma_ocr/:
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.gemini import GeminiEmbedding
from PIL import Image
import pytesseract

def extract_pdf_with_ocr(pdf_path: str) -> list[dict]:
    """
    Extract text from PDF including OCR for images.
    """
    from pdf2image import convert_from_path
    
    # Convert PDF pages to images
    images = convert_from_path(pdf_path)
    
    documents = []
    for i, image in enumerate(images):
        # OCR extraction
        text = pytesseract.image_to_string(image)
        
        documents.append({
            "text": text,
            "metadata": {
                "page": i + 1,
                "source": pdf_path
            }
        })
    
    return documents

# Use with LlamaIndex
from llama_index.core import Document

pdf_docs = extract_pdf_with_ocr("document.pdf")
llama_docs = [
    Document(text=d["text"], metadata=d["metadata"])
    for d in pdf_docs
]

# Build index with Gemini
index = VectorStoreIndex.from_documents(
    llama_docs,
    embed_model=GeminiEmbedding(),
)

query_engine = index.as_query_engine(
    llm=Gemini(model="gemini-2.0-flash-exp")
)

response = query_engine.query("Extract data from the charts")

Chat with Code

RAG optimized for code repositories. From rag_apps/chat_with_code/:
from llama_index.core import VectorStoreIndex
from llama_index.readers.github import GithubRepositoryReader
from llama_index.core.node_parser import CodeSplitter

# Load code from GitHub
reader = GithubRepositoryReader(
    owner="username",
    repo="repository",
    github_token=os.getenv("GITHUB_TOKEN"),
    filter_file_extensions=[".py", ".js", ".ts"],
)

documents = reader.load_data()

# Split code with language-aware chunking
code_splitter = CodeSplitter(
    language="python",
    chunk_lines=40,  # Lines per chunk
    chunk_lines_overlap=10,
    max_chars=1500,
)

nodes = code_splitter.get_nodes_from_documents(documents)

# Build index
index = VectorStoreIndex(nodes)

# Query with code context
query_engine = index.as_query_engine(
    similarity_top_k=5,
    llm=OpenAI(model="gpt-4-turbo")
)

response = query_engine.query(
    "How is authentication implemented in this codebase?"
)
print(response)

# Get code snippets
for node in response.source_nodes:
    print(f"File: {node.metadata['file_path']}")
    print(f"Lines: {node.metadata['start_line']}-{node.metadata['end_line']}")
    print(f"Code:\n{node.text}\n")

Streamlit RAG UI

Build interactive RAG applications with Streamlit.
import streamlit as st
from agno.agent import Agent
from agno.knowledge.url import UrlKnowledge
from agno.vectordb.lancedb import LanceDb

st.set_page_config(page_title="RAG Chat", layout="wide")

# Initialize session state
if 'knowledge_base' not in st.session_state:
    st.session_state.knowledge_base = None
if 'chat_history' not in st.session_state:
    st.session_state.chat_history = []

# Sidebar: Upload documents
with st.sidebar:
    st.header("Knowledge Base")
    
    # URL input
    urls = st.text_area(
        "Enter URLs (one per line)",
        height=100
    ).split("\n")
    
    urls = [url.strip() for url in urls if url.strip()]
    
    if st.button("Load Knowledge Base"):
        if urls:
            with st.spinner("Loading documents..."):
                knowledge_base = UrlKnowledge(
                    urls=urls,
                    vector_db=LanceDb(
                        table_name="docs",
                        uri="tmp/lancedb"
                    )
                )
                knowledge_base.load()
                st.session_state.knowledge_base = knowledge_base
                st.success(f"Loaded {len(urls)} URLs")
        else:
            st.warning("Please enter at least one URL")
    
    if st.session_state.knowledge_base:
        st.success("✅ Knowledge base ready")
        
        if st.button("Reset"):
            st.session_state.knowledge_base = None
            st.session_state.chat_history = []
            st.rerun()

# Main chat interface
st.title("RAG Chat Assistant")

if not st.session_state.knowledge_base:
    st.info("Load a knowledge base from the sidebar to start chatting")
else:
    # Display chat history
    for msg in st.session_state.chat_history:
        with st.chat_message(msg["role"]):
            st.markdown(msg["content"])
    
    # Chat input
    if prompt := st.chat_input("Ask a question"):
        # Add user message
        st.session_state.chat_history.append({
            "role": "user",
            "content": prompt
        })
        
        with st.chat_message("user"):
            st.markdown(prompt)
        
        # Generate response
        with st.chat_message("assistant"):
            with st.spinner("Thinking..."):
                agent = Agent(
                    model=model,
                    knowledge=st.session_state.knowledge_base,
                    search_knowledge=True,
                    markdown=True
                )
                
                response = agent.run(prompt, stream=True)
                
                answer = ""
                placeholder = st.empty()
                
                for chunk in response:
                    if hasattr(chunk, 'content') and chunk.content:
                        answer += chunk.content
                        placeholder.markdown(answer)
                
                st.session_state.chat_history.append({
                    "role": "assistant",
                    "content": answer
                })

Best Practices

1. Chunking Strategy

# ✅ Good: Language-appropriate chunking
from llama_index.core.node_parser import SentenceSplitter

# For prose/documentation
text_splitter = SentenceSplitter(
    chunk_size=512,      # Tokens per chunk
    chunk_overlap=50,    # Overlap for context
    separator="\n\n"     # Split on paragraphs
)

# For code
from llama_index.core.node_parser import CodeSplitter
code_splitter = CodeSplitter(
    language="python",
    chunk_lines=40,
    chunk_lines_overlap=10
)

# ❌ Bad: Fixed character splitting
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]
# Breaks mid-sentence, loses context

2. Metadata Enrichment

# ✅ Good: Rich metadata for filtering
from llama_index.core import Document

docs = [
    Document(
        text=content,
        metadata={
            "source": "user_manual.pdf",
            "page": 5,
            "section": "Installation",
            "date": "2024-01-15",
            "author": "Engineering Team",
            "version": "2.0"
        }
    )
]

# Filter by metadata during retrieval
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="section", value="Installation"),
        ExactMatchFilter(key="version", value="2.0")
    ]
)

retriever = index.as_retriever(filters=filters)

# ❌ Bad: No metadata
docs = [Document(text=content)]  # Missing context
# ✅ Good: Combine vector + keyword search
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever

# Vector retriever
vector_retriever = index.as_retriever(similarity_top_k=10)

# BM25 keyword retriever
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=10
)

# Fusion retriever (combines both)
retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,
)

# ❌ Bad: Only vector search
# Misses exact keyword matches

4. Re-ranking

# ✅ Good: Re-rank retrieved chunks
from llama_index.core.postprocessor import (
    SimilarityPostprocessor,
    LLMRerank
)

postprocessors = [
    # Filter by similarity score
    SimilarityPostprocessor(similarity_cutoff=0.7),
    
    # Re-rank with LLM
    LLMRerank(
        top_n=3,
        llm=OpenAI(model="gpt-4-turbo")
    )
]

query_engine = index.as_query_engine(
    node_postprocessors=postprocessors
)

# ❌ Bad: Use all retrieved chunks
# Low-quality chunks dilute context

Real-World Examples

MCP Documentation RAG

Location: rag_apps/agentic_rag/ RAG over MCP documentation with Arize Phoenix observability.

Resume Optimizer

Location: rag_apps/resume_optimizer/ RAG for resume optimization against job descriptions.

Conference CFP Generator

Location: advance_ai_agents/conference_agnositc_cfp_generator/ RAG with vector search over conference data for CFP generation.

Vector Database Comparison

DatabaseBest ForProsCons
LanceDBLocal dev, prototypesEmbedded, no server, fastLimited scale
QdrantProduction, scaleFast, scalable, open sourceRequires hosting
PineconeManaged cloudFully managed, easyCost, vendor lock-in
WeaviateML featuresRich features, GraphQLComplex setup
ChromaDBSimplicityEasy API, local/cloudLess mature

Next Steps

Memory Systems

Add persistent memory to RAG agents

MCP Integration

Connect RAG to external data sources

Multi-Agent Patterns

Combine RAG with multi-agent workflows

Best Practices

Production RAG patterns and optimization

Build docs developers (and LLMs) love