Skip to main content

Why Local RAG?

Local RAG implementations offer several advantages:
  • Privacy: All data stays on your infrastructure
  • Cost: No API fees for inference or embeddings
  • Control: Full control over models and data
  • Compliance: Meet regulatory requirements for data locality
  • Offline: Work without internet connectivity

Local RAG Architecture

All components run locally - no external API calls.

Basic Local RAG with Ollama

Complete implementation using Ollama, Qdrant, and Agno:
from agno.agent import Agent
from agno.models.ollama import Ollama
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.qdrant import Qdrant
from agno.knowledge.embedder.ollama import OllamaEmbedder
from agno.os import AgentOS

# Vector database with local embeddings
vector_db = Qdrant(
    collection="local-knowledge",
    url="http://localhost:6333/",
    embedder=OllamaEmbedder(
        model="nomic-embed-text",  # Local embedding model
        dimensions=768
    )
)

# Knowledge base
knowledge_base = Knowledge(
    vector_db=vector_db
)

# Add documents (only need to run once)
knowledge_base.add_content(
    url="https://example.com/document.pdf"
)

# Create local RAG agent
agent = Agent(
    name="Local RAG Agent",
    model=Ollama(id="llama3.2"),  # Local LLM
    knowledge=knowledge_base,
    instructions=[
        "Answer questions based on the knowledge base",
        "Be concise and accurate",
        "Cite sources when possible"
    ]
)

# UI for agent
agent_os = AgentOS(agents=[agent])
app = agent_os.get_app()

# Run the agent
if __name__ == "__main__":
    agent_os.serve(app="local_rag_agent:app", reload=True)

Setup Instructions

1

Install Ollama

Download and install Ollama:
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com/download
2

Pull Local Models

Download models for LLM and embeddings:
# LLM options (choose one or more)
ollama pull llama3.2         # Latest Llama model (small, fast)
ollama pull llama3.1         # Llama 3.1 (larger, more capable)
ollama pull mistral          # Mistral 7B
ollama pull qwen2.5          # Qwen 2.5
ollama pull deepseek-r1:8b   # DeepSeek R1 8B

# Embedding models
ollama pull nomic-embed-text # Best for embeddings
ollama pull openhermes       # Alternative embedder
3

Start Qdrant

Run Qdrant vector database:
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant
4

Run Application

Start your local RAG agent:
python local_rag_agent.py
Open http://localhost:7777 in your browser.

Local RAG with LangChain

Alternative implementation using LangChain:
llama_local_rag.py
import streamlit as st
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

st.title("🦙 Local Llama RAG")

# Initialize local models
@st.cache_resource
def init_models():
    # Local embeddings
    embeddings = OllamaEmbeddings(
        model="nomic-embed-text",
        base_url="http://localhost:11434"
    )
    
    # Local LLM
    llm = Ollama(
        model="llama3.2",
        base_url="http://localhost:11434",
        temperature=0.7
    )
    
    return embeddings, llm

embeddings, llm = init_models()

# Initialize vector store
@st.cache_resource
def init_vectorstore(_embeddings):
    return Chroma(
        collection_name="local_docs",
        embedding_function=_embeddings,
        persist_directory="./local_db"
    )

vectorstore = init_vectorstore(embeddings)

# Sidebar for document loading
with st.sidebar:
    st.header("Load Documents")
    url = st.text_input(
        "Enter webpage URL:",
        placeholder="https://example.com/article"
    )
    
    if st.button("Load URL"):
        if url:
            with st.spinner("Loading and processing..."):
                # Load webpage
                loader = WebBaseLoader(url)
                documents = loader.load()
                
                # Split into chunks
                text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size=500,
                    chunk_overlap=100
                )
                chunks = text_splitter.split_documents(documents)
                
                # Add to vector store
                vectorstore.add_documents(chunks)
                st.success(f"Loaded {len(chunks)} chunks from {url}")

# Query interface
query = st.text_area(
    "Ask a question:",
    placeholder="What would you like to know about the documents?"
)

if st.button("Submit"):
    if query:
        # Create retriever
        retriever = vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={'k': 5}
        )
        
        # Format documents
        def format_docs(docs):
            return "\n\n".join(doc.page_content for doc in docs)
        
        # Create prompt
        prompt = ChatPromptTemplate.from_template("""
        Answer the question based on the context below.
        
        Context: {context}
        
        Question: {question}
        
        Provide a detailed answer based on the context.
        If the answer is not in the context, say so.
        """)
        
        # Create RAG chain
        rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | llm
            | StrOutputParser()
        )
        
        # Execute and stream response
        with st.spinner("Thinking..."):
            response_placeholder = st.empty()
            full_response = ""
            
            for chunk in rag_chain.stream(query):
                full_response += chunk
                response_placeholder.markdown(full_response)
    else:
        st.warning("Please enter a question")

Local Embedding Models

from agno.knowledge.embedder.ollama import OllamaEmbedder

embedder = OllamaEmbedder(
    model="nomic-embed-text",
    dimensions=768
)
Specifications:
  • Dimensions: 768
  • Context length: 8,192 tokens
  • Size: ~274 MB
  • Best for: General text embedding
Pull command:
ollama pull nomic-embed-text

Local LLM Options

ollama pull llama3.1:8b  # 8B parameter version
Why choose Llama 3.1:
  • More capable than 3.2
  • Better reasoning
  • 8B params: ~5GB RAM
  • Larger context window (128K)
Usage:
llm = Ollama(id="llama3.1:8b")
ollama pull mistral
Why choose Mistral:
  • Excellent quality
  • Good instruction following
  • 7B params: ~4GB RAM
  • Fast inference
Usage:
llm = Ollama(id="mistral")
ollama pull qwen2.5:7b
Why choose Qwen:
  • Strong multilingual support
  • Excellent code generation
  • 7B params: ~4GB RAM
  • Long context (32K)
Usage:
llm = Ollama(id="qwen2.5:7b")
ollama pull deepseek-r1:8b
Why choose DeepSeek:
  • Reasoning capabilities
  • Math and logic focused
  • 8B params: ~5GB RAM
  • Good for complex queries
Usage:
llm = Ollama(id="deepseek-r1:8b")

Local Vector Database Options

from agno.vectordb.qdrant import Qdrant
from agno.knowledge.embedder.ollama import OllamaEmbedder

vector_db = Qdrant(
    collection="my_docs",
    url="http://localhost:6333/",
    embedder=OllamaEmbedder()
)
Setup:
docker run -p 6333:6333 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant
Features:

Local Hybrid Search RAG

Combine local vector and keyword search:
local_hybrid_rag.py
import streamlit as st
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.document_loaders import PyPDFLoader

st.title("🔍 Local Hybrid Search RAG")

# Local models
embeddings = OllamaEmbeddings(model="nomic-embed-text")
llm = Ollama(model="llama3.2")

# Upload PDF
uploaded_file = st.file_uploader("Upload PDF", type="pdf")

if uploaded_file:
    # Save and load PDF
    with open("temp.pdf", "wb") as f:
        f.write(uploaded_file.getbuffer())
    
    loader = PyPDFLoader("temp.pdf")
    documents = loader.load()
    
    # Split documents
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    chunks = text_splitter.split_documents(documents)
    
    # Create vector store (semantic search)
    vectorstore = Chroma.from_documents(
        chunks,
        embeddings,
        collection_name="hybrid_search"
    )
    vector_retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
    
    # Create BM25 retriever (keyword search)
    bm25_retriever = BM25Retriever.from_documents(chunks)
    bm25_retriever.k = 5
    
    # Ensemble retriever (combines both)
    ensemble_retriever = EnsembleRetriever(
        retrievers=[vector_retriever, bm25_retriever],
        weights=[0.5, 0.5]  # Equal weight to semantic and keyword
    )
    
    st.success(f"Processed {len(chunks)} chunks")
    
    # Query
    query = st.text_input("Ask a question:")
    
    if st.button("Search") and query:
        # Retrieve using hybrid search
        docs = ensemble_retriever.get_relevant_documents(query)
        
        # Show retrieved chunks
        with st.expander("Retrieved Chunks"):
            for i, doc in enumerate(docs[:5], 1):
                st.markdown(f"**Chunk {i}**")
                st.write(doc.page_content)
                st.divider()
        
        # Generate answer
        context = "\n\n".join([doc.page_content for doc in docs])
        
        prompt = f"""Answer the question based on the context.
        
Context:
{context}

Question: {query}

Answer:"""
        
        with st.spinner("Generating answer..."):
            response = llm.invoke(prompt)
            st.subheader("Answer:")
            st.write(response)

Performance Optimization

Choose based on hardware:
import psutil

# Check available RAM
ram_gb = psutil.virtual_memory().total / (1024**3)

if ram_gb < 8:
    # Use smallest models
    llm_model = "llama3.2:1b"
    embed_model = "all-MiniLM-L6-v2"
elif ram_gb < 16:
    # Use medium models
    llm_model = "llama3.2:3b"
    embed_model = "nomic-embed-text"
else:
    # Use larger models
    llm_model = "llama3.1:8b"
    embed_model = "nomic-embed-text"

print(f"Selected: {llm_model}, {embed_model}")
import streamlit as st

# Cache model initialization
@st.cache_resource
def load_models():
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    llm = Ollama(model="llama3.2")
    return embeddings, llm

# Cache vector store
@st.cache_resource
def load_vectorstore(_embeddings):
    return Chroma(
        embedding_function=_embeddings,
        persist_directory="./db"
    )

embeddings, llm = load_models()
vectorstore = load_vectorstore(embeddings)
# Process documents in batches
batch_size = 100

for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i + batch_size]
    vectorstore.add_documents(batch)
    print(f"Processed batch {i//batch_size + 1}")
# Install Ollama with GPU support
# NVIDIA GPU
docker run -d --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Then pull models
docker exec -it ollama ollama pull llama3.2
Check GPU usage:
nvidia-smi

Troubleshooting

Issue: “Connection refused” to OllamaSolution:
# Check if Ollama is running
ps aux | grep ollama

# Restart Ollama
systemctl restart ollama

# Or start manually
ollama serve
Issue: Out of memory errorsSolution:
  • Use smaller models (3B instead of 7B)
  • Reduce chunk size and batch size
  • Close other applications
  • Consider GPU acceleration
Issue: Slow response timesSolution:
  • Use GPU if available
  • Reduce number of retrieved chunks (k=3 instead of k=5)
  • Use smaller embedding model
  • Enable model quantization

Production Deployment

1

Containerize

Dockerfile
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y curl

# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh

# Copy application
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

# Pull models on build
RUN ollama serve & \
    sleep 5 && \
    ollama pull llama3.2 && \
    ollama pull nomic-embed-text

EXPOSE 8501
CMD ["streamlit", "run", "app.py"]
2

Docker Compose

docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
  
  app:
    build: .
    ports:
      - "8501:8501"
    environment:
      - OLLAMA_HOST=http://ollama:11434
      - QDRANT_URL=http://qdrant:6333
    depends_on:
      - ollama
      - qdrant

volumes:
  ollama_data:
  qdrant_data:
3

Deploy

# Build and start services
docker-compose up -d

# View logs
docker-compose logs -f app

# Access at http://localhost:8501

Cost Comparison

Cloud RAG

Monthly costs (1000 queries/day):
  • OpenAI embeddings: $20-50
  • OpenAI GPT-4: $200-500
  • Vector DB hosting: $50-200
  • Total: $270-750/month

Local RAG

One-time costs:
  • Server/hardware: $500-2000
  • Setup time: 4-8 hours
  • Monthly: $0 (electricity only)
  • ROI: 1-3 months

Next Steps

Back to Overview

Return to RAG Applications overview
Hardware Recommendations:
  • Minimum: 8GB RAM, CPU only - Use 1B-3B models
  • Recommended: 16GB RAM, CPU only - Use 3B-7B models
  • Optimal: 16GB+ RAM, NVIDIA GPU (8GB+ VRAM) - Use 7B-13B models

Build docs developers (and LLMs) love