Skip to main content

Overview

Chat with X applications use Retrieval Augmented Generation (RAG) to enable conversations with various data sources. These tutorials show you how to build interactive chat interfaces for documents, codebases, emails, and multimedia content.

Chat with PDF

Extract and query PDF documents

Chat with GitHub

Search and analyze codebases

Chat with Gmail

Query your email inbox

Chat with YouTube

Analyze video transcripts

Chat with Research

Search academic papers

Chat with Substack

Query newsletter archives

Core RAG Architecture

All “Chat with X” applications follow a common pattern:
1

Data Ingestion

Load and preprocess content from the target source (PDF, GitHub, Gmail, etc.)
2

Chunking & Embedding

Split content into chunks and generate vector embeddings
3

Vector Storage

Store embeddings in a vector database (Chroma, Qdrant, etc.)
4

Retrieval

Find relevant chunks using semantic similarity search
5

Generation

Pass retrieved context to LLM for answer generation

Chat with PDF

Build a RAG application to query PDF documents in just 30 lines of Python

Implementation

import os
import tempfile
import streamlit as st
from embedchain import App

def embedchain_bot(db_path, api_key):
    return App.from_config(
        config={
            "llm": {"provider": "openai", "config": {"api_key": api_key}},
            "vectordb": {"provider": "chroma", "config": {"dir": db_path}},
            "embedder": {"provider": "openai", "config": {"api_key": api_key}},
        }
    )

st.title("Chat with PDF")

openai_access_token = st.text_input("OpenAI API Key", type="password")

if openai_access_token:
    db_path = tempfile.mkdtemp()
    app = embedchain_bot(db_path, openai_access_token)

    pdf_file = st.file_uploader("Upload a PDF file", type="pdf")

    if pdf_file:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as f:
            f.write(pdf_file.getvalue())
            app.add(f.name, data_type="pdf_file")
        os.remove(f.name)
        st.success(f"Added {pdf_file.name} to knowledge base!")

    prompt = st.text_input("Ask a question about the PDF")

    if prompt:
        answer = app.chat(prompt)
        st.write(answer)

Key Features

  • Extracts text from multi-page PDFs
  • Handles embedded images and tables
  • Preserves document structure
  • Supports scanned PDFs with OCR (optional)
# Optimal chunking for PDFs
chunk_size = 1000  # tokens
chunk_overlap = 200  # tokens for context preservation

# Embedchain handles this automatically
app.add(pdf_path, data_type="pdf_file")
Effective prompt patterns:
  • “Summarize the key findings in section 3”
  • “What methodology was used in the research?”
  • “Compare the results from pages 5 and 10”
  • “Extract all statistics about X”

Setup

pip install streamlit embedchain openai chromadb

Chat with GitHub Repos

Query codebases, understand architecture, and find implementations using natural language

Implementation

from embedchain.pipeline import Pipeline as App
from embedchain.loaders.github import GithubLoader
import streamlit as st
import os

loader = GithubLoader(
    config={
        "token": "your_github_token",
    }
)

st.title("Chat with GitHub Repository 💬")
st.caption("Query codebases using natural language")

openai_access_token = st.text_input("OpenAI API Key", type="password")

if openai_access_token:
    os.environ["OPENAI_API_KEY"] = openai_access_token
    app = App()
    
    git_repo = st.text_input("Enter GitHub Repo (e.g., username/repo)")
    
    if git_repo:
        # Add repo to knowledge base
        app.add(
            f"repo:{git_repo} type:repo",
            data_type="github",
            loader=loader
        )
        st.success(f"Added {git_repo} to knowledge base!")
        
        # Ask questions
        prompt = st.text_input("Ask about the repository")
        
        if prompt:
            answer = app.chat(prompt)
            st.write(answer)

Example Queries

Architecture

“How is the authentication system structured?”

Implementation

“Show me how error handling is implemented”

Dependencies

“What external libraries does this project use?”

Best Practices

“How does the codebase handle configuration?”

GitHub Access Configuration

1

Generate Personal Access Token

Go to GitHub Settings → Developer settings → Personal access tokens
2

Set Permissions

Enable repo scope for accessing repository contents
3

Configure Loader

loader = GithubLoader(config={"token": "ghp_your_token_here"})
Never commit GitHub tokens to version control. Use environment variables or secrets management.

Chat with Gmail

Search and analyze your email inbox using natural language queries

Implementation

import tempfile
import streamlit as st
from embedchain import App

def embedchain_bot(db_path, api_key):
    return App.from_config(
        config={
            "llm": {"provider": "openai", "config": {"api_key": api_key}},
            "vectordb": {"provider": "chroma", "config": {"dir": db_path}},
            "embedder": {"provider": "openai", "config": {"api_key": api_key}},
        }
    )

st.title("Chat with your Gmail Inbox 📧")

openai_access_token = st.text_input("OpenAI API Key", type="password")

# Gmail filter syntax
gmail_filter = "to: me label:inbox"

if openai_access_token:
    db_path = tempfile.mkdtemp()
    app = embedchain_bot(db_path, openai_access_token)
    
    # Add Gmail data
    app.add(gmail_filter, data_type="gmail")
    st.success("Added emails from Inbox to knowledge base!")

    prompt = st.text_input("Ask about your emails")

    if prompt:
        answer = app.query(prompt)
        st.write(answer)

Gmail API Setup

1

Create Google Cloud Project

Go to Google Cloud Console and create a new project
2

Enable Gmail API

Navigate to APIs & Services → Library → Search for “Gmail API” → Enable
3

Configure OAuth Consent

  • Go to APIs & Services → OAuth consent screen
  • Select “External” user type
  • Fill in app information
  • Add test users (your email)
  • Publish the consent screen
4

Create OAuth Credentials

  • APIs & Services → Credentials → Create Credentials
  • Select “OAuth client ID”
  • Application type: “Desktop app”
  • Download credentials as credentials.json
5

Place Credentials

Save credentials.json in your project directory

Gmail Query Filters

gmail_filter
string
Gmail search operators for filtering emails

Example Queries

"Summarize emails from my manager this week"
"Find all emails about the Q4 project"
"What action items were mentioned in recent emails?"

Chat with YouTube Videos

Analyze video content through transcripts without watching the entire video

Implementation

import streamlit as st
from embedchain import App
import tempfile

st.title("Chat with YouTube Videos 📽️")

openai_key = st.text_input("OpenAI API Key", type="password")

if openai_key:
    db_path = tempfile.mkdtemp()
    app = App.from_config(
        config={
            "llm": {"provider": "openai", "config": {"api_key": openai_key}},
            "vectordb": {"provider": "chroma", "config": {"dir": db_path}},
            "embedder": {"provider": "openai", "config": {"api_key": openai_key}},
        }
    )
    
    youtube_url = st.text_input("YouTube Video URL")
    
    if youtube_url:
        # Add video transcript
        app.add(youtube_url, data_type="youtube_video")
        st.success("Video transcript added!")
        
        query = st.text_input("Ask about the video")
        
        if query:
            answer = app.chat(query)
            st.write(answer)
            
            # Display video
            st.video(youtube_url)

Transcript Processing

  1. Extract Transcript: Uses youtube-transcript-api to fetch captions
  2. Chunk Text: Splits transcript into semantic chunks with timestamps
  3. Generate Embeddings: Creates vector representations
  4. Query: Retrieves relevant segments based on question
  5. Context: Includes timestamp information in responses

Use Cases

Tutorial Videos

“What tools were used in this tutorial?”

Lectures

“Summarize the key concepts explained”

Podcasts

“What did they say about AI regulation?”

Product Reviews

“List all pros and cons mentioned”

Chat with Research Papers

Search and query arXiv papers using conversational AI

Implementation

import streamlit as st
from embedchain import App
import os

st.title("Chat with Arxiv Research Papers 🔎")

openai_key = st.text_input("OpenAI API Key", type="password")

if openai_key:
    os.environ["OPENAI_API_KEY"] = openai_key
    app = App()
    
    # arXiv search topic
    topic = st.text_input("Research topic (e.g., 'transformers in NLP')")
    
    if topic:
        # Search and add papers
        app.add(f"arxiv:{topic}", data_type="arxiv")
        st.success(f"Added papers about '{topic}'")
        
        query = st.text_input("Ask about the research")
        
        if query:
            answer = app.chat(query)
            st.write(answer)

Research Queries

  • “What datasets were used in these papers?”
  • “How do the approaches differ?”
  • “What evaluation metrics are common?”
  • “Compare the results across different papers”
  • “Which method achieved the best performance?”
  • “What are the main limitations discussed?”
  • “What architectures are used?”
  • “List the hyperparameters mentioned”
  • “What preprocessing steps are described?”

Chat with Substack

Query newsletter archives and extract insights from blog posts

Implementation

import streamlit as st
from embedchain import App
import os

st.title("Chat with Substack Newsletter 📝")

openai_key = st.text_input("OpenAI API Key", type="password")

if openai_key:
    os.environ["OPENAI_API_KEY"] = openai_key
    app = App()
    
    substack_url = st.text_input("Substack Blog URL")
    
    if substack_url:
        # Add Substack content
        app.add(substack_url, data_type="web_page")
        st.success("Substack newsletter added!")
        
        query = st.text_input("Ask about the content")
        
        if query:
            answer = app.chat(query)
            st.write(answer)

Common Patterns & Best Practices

Embedchain Configuration

config
object
Complete configuration object for Embedchain

Optimization Tips

# Optimal chunk sizes by content type
chunk_configs = {
    "pdf": {"chunk_size": 1000, "overlap": 200},
    "code": {"chunk_size": 1500, "overlap": 300},
    "email": {"chunk_size": 800, "overlap": 100},
    "transcript": {"chunk_size": 1200, "overlap": 200},
}
  • Use GPT-4o-mini for embeddings (cheaper)
  • Cache vector databases between sessions
  • Limit retrieval to top 3-5 chunks
  • Implement query optimization
# Improve responses with better prompts
system_prompt = """
You are a helpful assistant analyzing {data_type}.
Always cite specific sections when answering.
If information is not in the context, say so clearly.
"""

app = App.from_config({
    "llm": {
        "provider": "openai",
        "config": {
            "system_prompt": system_prompt,
            "temperature": 0.3  # Lower for factual accuracy
        }
    }
})

Multi-Source Chat

import streamlit as st
from embedchain import App
import tempfile

st.title("Multi-Source Chat")

api_key = st.text_input("OpenAI API Key", type="password")

if api_key:
    db_path = tempfile.mkdtemp()
    app = App.from_config({
        "llm": {"provider": "openai", "config": {"api_key": api_key}},
        "vectordb": {"provider": "chroma", "config": {"dir": db_path}},
        "embedder": {"provider": "openai", "config": {"api_key": api_key}},
    })
    
    # Add multiple sources
    pdf = st.file_uploader("Upload PDF", type="pdf")
    youtube = st.text_input("YouTube URL")
    github = st.text_input("GitHub Repo")
    
    if pdf:
        app.add(pdf, data_type="pdf_file")
    if youtube:
        app.add(youtube, data_type="youtube_video")
    if github:
        app.add(f"repo:{github} type:repo", data_type="github")
    
    # Query across all sources
    query = st.text_input("Ask anything")
    if query:
        answer = app.chat(query)
        st.write(answer)

Resources

Embedchain Docs

Complete framework documentation

Example Repository

All Chat with X implementations

RAG Tutorial

Step-by-step RAG guide

Gmail Tutorial

Complete Gmail RAG tutorial

Build docs developers (and LLMs) love