Overview
Chat with X applications use Retrieval Augmented Generation (RAG) to enable conversations with various data sources. These tutorials show you how to build interactive chat interfaces for documents, codebases, emails, and multimedia content.
Chat with PDF Extract and query PDF documents
Chat with GitHub Search and analyze codebases
Chat with Gmail Query your email inbox
Chat with YouTube Analyze video transcripts
Chat with Research Search academic papers
Chat with Substack Query newsletter archives
Core RAG Architecture
All “Chat with X” applications follow a common pattern:
Data Ingestion
Load and preprocess content from the target source (PDF, GitHub, Gmail, etc.)
Chunking & Embedding
Split content into chunks and generate vector embeddings
Vector Storage
Store embeddings in a vector database (Chroma, Qdrant, etc.)
Retrieval
Find relevant chunks using semantic similarity search
Generation
Pass retrieved context to LLM for answer generation
Chat with PDF
Build a RAG application to query PDF documents in just 30 lines of Python
Implementation
OpenAI + Embedchain
Local with Llama 3.2
import os
import tempfile
import streamlit as st
from embedchain import App
def embedchain_bot ( db_path , api_key ):
return App.from_config(
config = {
"llm" : { "provider" : "openai" , "config" : { "api_key" : api_key}},
"vectordb" : { "provider" : "chroma" , "config" : { "dir" : db_path}},
"embedder" : { "provider" : "openai" , "config" : { "api_key" : api_key}},
}
)
st.title( "Chat with PDF" )
openai_access_token = st.text_input( "OpenAI API Key" , type = "password" )
if openai_access_token:
db_path = tempfile.mkdtemp()
app = embedchain_bot(db_path, openai_access_token)
pdf_file = st.file_uploader( "Upload a PDF file" , type = "pdf" )
if pdf_file:
with tempfile.NamedTemporaryFile( delete = False , suffix = ".pdf" ) as f:
f.write(pdf_file.getvalue())
app.add(f.name, data_type = "pdf_file" )
os.remove(f.name)
st.success( f "Added { pdf_file.name } to knowledge base!" )
prompt = st.text_input( "Ask a question about the PDF" )
if prompt:
answer = app.chat(prompt)
st.write(answer)
Key Features
Extracts text from multi-page PDFs
Handles embedded images and tables
Preserves document structure
Supports scanned PDFs with OCR (optional)
# Optimal chunking for PDFs
chunk_size = 1000 # tokens
chunk_overlap = 200 # tokens for context preservation
# Embedchain handles this automatically
app.add(pdf_path, data_type = "pdf_file" )
Effective prompt patterns:
“Summarize the key findings in section 3”
“What methodology was used in the research?”
“Compare the results from pages 5 and 10”
“Extract all statistics about X”
Setup
pip install streamlit embedchain openai chromadb
streamlit run chat_pdf.py
Chat with GitHub Repos
Query codebases, understand architecture, and find implementations using natural language
Implementation
from embedchain.pipeline import Pipeline as App
from embedchain.loaders.github import GithubLoader
import streamlit as st
import os
loader = GithubLoader(
config = {
"token" : "your_github_token" ,
}
)
st.title( "Chat with GitHub Repository 💬" )
st.caption( "Query codebases using natural language" )
openai_access_token = st.text_input( "OpenAI API Key" , type = "password" )
if openai_access_token:
os.environ[ "OPENAI_API_KEY" ] = openai_access_token
app = App()
git_repo = st.text_input( "Enter GitHub Repo (e.g., username/repo)" )
if git_repo:
# Add repo to knowledge base
app.add(
f "repo: { git_repo } type:repo" ,
data_type = "github" ,
loader = loader
)
st.success( f "Added { git_repo } to knowledge base!" )
# Ask questions
prompt = st.text_input( "Ask about the repository" )
if prompt:
answer = app.chat(prompt)
st.write(answer)
Example Queries
Architecture “How is the authentication system structured?”
Implementation “Show me how error handling is implemented”
Dependencies “What external libraries does this project use?”
Best Practices “How does the codebase handle configuration?”
GitHub Access Configuration
Generate Personal Access Token
Go to GitHub Settings → Developer settings → Personal access tokens
Set Permissions
Enable repo scope for accessing repository contents
Configure Loader
loader = GithubLoader( config = { "token" : "ghp_your_token_here" })
Never commit GitHub tokens to version control. Use environment variables or secrets management.
Chat with Gmail
Search and analyze your email inbox using natural language queries
Implementation
import tempfile
import streamlit as st
from embedchain import App
def embedchain_bot ( db_path , api_key ):
return App.from_config(
config = {
"llm" : { "provider" : "openai" , "config" : { "api_key" : api_key}},
"vectordb" : { "provider" : "chroma" , "config" : { "dir" : db_path}},
"embedder" : { "provider" : "openai" , "config" : { "api_key" : api_key}},
}
)
st.title( "Chat with your Gmail Inbox 📧" )
openai_access_token = st.text_input( "OpenAI API Key" , type = "password" )
# Gmail filter syntax
gmail_filter = "to: me label:inbox"
if openai_access_token:
db_path = tempfile.mkdtemp()
app = embedchain_bot(db_path, openai_access_token)
# Add Gmail data
app.add(gmail_filter, data_type = "gmail" )
st.success( "Added emails from Inbox to knowledge base!" )
prompt = st.text_input( "Ask about your emails" )
if prompt:
answer = app.query(prompt)
st.write(answer)
Gmail API Setup
Complete OAuth Configuration
Create Google Cloud Project
Enable Gmail API
Navigate to APIs & Services → Library → Search for “Gmail API” → Enable
Configure OAuth Consent
Go to APIs & Services → OAuth consent screen
Select “External” user type
Fill in app information
Add test users (your email)
Publish the consent screen
Create OAuth Credentials
APIs & Services → Credentials → Create Credentials
Select “OAuth client ID”
Application type: “Desktop app”
Download credentials as credentials.json
Place Credentials
Save credentials.json in your project directory
Gmail Query Filters
Gmail search operators for filtering emails # All inbox emails
"label:inbox"
# Unread emails
"is:unread"
# From specific sender
"from:[email protected] "
# Date range
"after:2024/01/01 before:2024/12/31"
# Has attachment
"has:attachment"
# Combine filters
"from:[email protected] is:unread has:attachment"
Example Queries
Business Queries
Personal Queries
"Summarize emails from my manager this week"
"Find all emails about the Q4 project"
"What action items were mentioned in recent emails?"
Chat with YouTube Videos
Analyze video content through transcripts without watching the entire video
Implementation
import streamlit as st
from embedchain import App
import tempfile
st.title( "Chat with YouTube Videos 📽️" )
openai_key = st.text_input( "OpenAI API Key" , type = "password" )
if openai_key:
db_path = tempfile.mkdtemp()
app = App.from_config(
config = {
"llm" : { "provider" : "openai" , "config" : { "api_key" : openai_key}},
"vectordb" : { "provider" : "chroma" , "config" : { "dir" : db_path}},
"embedder" : { "provider" : "openai" , "config" : { "api_key" : openai_key}},
}
)
youtube_url = st.text_input( "YouTube Video URL" )
if youtube_url:
# Add video transcript
app.add(youtube_url, data_type = "youtube_video" )
st.success( "Video transcript added!" )
query = st.text_input( "Ask about the video" )
if query:
answer = app.chat(query)
st.write(answer)
# Display video
st.video(youtube_url)
Transcript Processing
Extract Transcript : Uses youtube-transcript-api to fetch captions
Chunk Text : Splits transcript into semantic chunks with timestamps
Generate Embeddings : Creates vector representations
Query : Retrieves relevant segments based on question
Context : Includes timestamp information in responses
Use Cases
Tutorial Videos “What tools were used in this tutorial?”
Lectures “Summarize the key concepts explained”
Podcasts “What did they say about AI regulation?”
Product Reviews “List all pros and cons mentioned”
Chat with Research Papers
Search and query arXiv papers using conversational AI
Implementation
import streamlit as st
from embedchain import App
import os
st.title( "Chat with Arxiv Research Papers 🔎" )
openai_key = st.text_input( "OpenAI API Key" , type = "password" )
if openai_key:
os.environ[ "OPENAI_API_KEY" ] = openai_key
app = App()
# arXiv search topic
topic = st.text_input( "Research topic (e.g., 'transformers in NLP')" )
if topic:
# Search and add papers
app.add( f "arxiv: { topic } " , data_type = "arxiv" )
st.success( f "Added papers about ' { topic } '" )
query = st.text_input( "Ask about the research" )
if query:
answer = app.chat(query)
st.write(answer)
Research Queries
“What datasets were used in these papers?”
“How do the approaches differ?”
“What evaluation metrics are common?”
“Compare the results across different papers”
“Which method achieved the best performance?”
“What are the main limitations discussed?”
“What architectures are used?”
“List the hyperparameters mentioned”
“What preprocessing steps are described?”
Chat with Substack
Query newsletter archives and extract insights from blog posts
Implementation
import streamlit as st
from embedchain import App
import os
st.title( "Chat with Substack Newsletter 📝" )
openai_key = st.text_input( "OpenAI API Key" , type = "password" )
if openai_key:
os.environ[ "OPENAI_API_KEY" ] = openai_key
app = App()
substack_url = st.text_input( "Substack Blog URL" )
if substack_url:
# Add Substack content
app.add(substack_url, data_type = "web_page" )
st.success( "Substack newsletter added!" )
query = st.text_input( "Ask about the content" )
if query:
answer = app.chat(query)
st.write(answer)
Common Patterns & Best Practices
Embedchain Configuration
Complete configuration object for Embedchain Show Configuration Options
LLM provider: openai, anthropic, cohere, ollama
Model name (e.g., gpt-4o, claude-3-5-sonnet-20241022)
Vector database: chroma, qdrant, pinecone, weaviate
Embedding provider: openai, cohere, huggingface
Optimization Tips
# Optimal chunk sizes by content type
chunk_configs = {
"pdf" : { "chunk_size" : 1000 , "overlap" : 200 },
"code" : { "chunk_size" : 1500 , "overlap" : 300 },
"email" : { "chunk_size" : 800 , "overlap" : 100 },
"transcript" : { "chunk_size" : 1200 , "overlap" : 200 },
}
Use GPT-4o-mini for embeddings (cheaper)
Cache vector databases between sessions
Limit retrieval to top 3-5 chunks
Implement query optimization
# Improve responses with better prompts
system_prompt = """
You are a helpful assistant analyzing {data_type} .
Always cite specific sections when answering.
If information is not in the context, say so clearly.
"""
app = App.from_config({
"llm" : {
"provider" : "openai" ,
"config" : {
"system_prompt" : system_prompt,
"temperature" : 0.3 # Lower for factual accuracy
}
}
})
Multi-Source Chat
Combine Multiple Data Sources
import streamlit as st
from embedchain import App
import tempfile
st.title( "Multi-Source Chat" )
api_key = st.text_input( "OpenAI API Key" , type = "password" )
if api_key:
db_path = tempfile.mkdtemp()
app = App.from_config({
"llm" : { "provider" : "openai" , "config" : { "api_key" : api_key}},
"vectordb" : { "provider" : "chroma" , "config" : { "dir" : db_path}},
"embedder" : { "provider" : "openai" , "config" : { "api_key" : api_key}},
})
# Add multiple sources
pdf = st.file_uploader( "Upload PDF" , type = "pdf" )
youtube = st.text_input( "YouTube URL" )
github = st.text_input( "GitHub Repo" )
if pdf:
app.add(pdf, data_type = "pdf_file" )
if youtube:
app.add(youtube, data_type = "youtube_video" )
if github:
app.add( f "repo: { github } type:repo" , data_type = "github" )
# Query across all sources
query = st.text_input( "Ask anything" )
if query:
answer = app.chat(query)
st.write(answer)
Resources
Embedchain Docs Complete framework documentation
Example Repository All Chat with X implementations
RAG Tutorial Step-by-step RAG guide
Gmail Tutorial Complete Gmail RAG tutorial