Uploading Documents

Overview

RAG Chat allows you to upload PDF documents that will be processed, chunked, and stored in a vector database for intelligent question answering. This guide walks you through the upload process and best practices.

Upload Interface

The file upload interface is located in the sidebar of the RAG Chat application:

app.py

with st.sidebar:
    st.header('Upload de arquivos')
    uploaded_files = st.file_uploader(
        label='Faça aqui o upload dos seus arquivos: ',
        accept_multiple_files=True,
        type='pdf',
    )

You can upload multiple PDF files at once. All files will be processed and added to the same vector store.

How Document Processing Works

File Upload

Select one or more PDF files using the file uploader in the sidebar.

PDF Loading

Each file is processed using PyPDFLoader, which extracts text from all pages:

app.py

def process_file(file):
    with NamedTemporaryFile(delete=False, suffix='.pdf') as temp_file:
        temp_file.write(file.read())
        temp_file_path = temp_file.name
        loader = PyPDFLoader(temp_file_path)
        docs = loader.load()

Text Chunking

The extracted text is split into manageable chunks for better retrieval:

app.py

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 400
)
chunks = text_splitter.split_documents(docs)

chunk_size: 1000 characters per chunk
chunk_overlap: 400 characters overlap between chunks to preserve context

Vector Store Creation

Chunks are embedded and stored in a Chroma vector database:

app.py

def add_to_vector_store(documents, vector_store = None):
    if vector_store:
        vector_store.add_documents(documents)
    else:
        vector_store = Chroma.from_documents(
            documents=documents,
            embedding=OpenAIEmbeddings(),
            persist_directory=persistant_directory
        )
    return vector_store

Processing Flow

When you upload files, the system processes them automatically:

app.py

with st.spinner('Carregando arquivos...'):
    all_chunks = []
    for uploaded_file in uploaded_files:
        chunks = process_file(uploaded_file)
        all_chunks.extend(chunks)

    if all_chunks:
        vector_store = add_to_vector_store(
            vector_store = vector_store,
            documents = all_chunks
        )

Best Practices

Optimal Document Types:

Technical documentation
Research papers
Reports and white papers
Books and manuals
Any text-heavy PDF content

Avoid:

Image-only PDFs (without OCR)
Heavily formatted documents with complex layouts
Password-protected PDFs
Scanned documents without text layer

Tips for Best Results

Multiple Related Documents: Upload documents on related topics together for better context understanding
Clean PDFs: Use PDFs with clear text extraction (not scanned images)
Reasonable Size: While there’s no strict limit, extremely large documents may take longer to process
Incremental Uploads: You can upload additional documents at any time - they’ll be added to the existing vector store

Persistent Storage

Documents are stored persistently in the db directory:

app.py

persistant_directory = 'db'

def load_existing_vector_store():
    if os.path.exists(persistant_directory):
        vector_store = Chroma(
            persist_directory=persistant_directory,
            embedding_function=OpenAIEmbeddings()
        )
        return vector_store
    return None

Once uploaded, documents persist across sessions. You don’t need to re-upload them every time you restart the application.

Get Started

Core Concepts

Guides

Reference

Advanced

Uploading Documents

Overview

Upload Interface

How Document Processing Works

Processing Flow

Best Practices

Tips for Best Results

Persistent Storage

Troubleshooting

Upload Not Working

Slow Processing

Next Steps

Asking Questions

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Reference

Advanced

​Overview

​Upload Interface

​How Document Processing Works

​Processing Flow

​Best Practices

​Tips for Best Results

​Persistent Storage

​Troubleshooting

​Upload Not Working

​Slow Processing

​Next Steps

Asking Questions

Configuration

Build docs developers (and LLMs) love

Overview

Upload Interface

How Document Processing Works

Processing Flow

Best Practices

Tips for Best Results

Persistent Storage

Troubleshooting

Upload Not Working

Slow Processing

Next Steps