Supported Formats

Overview

RAG Chat currently supports PDF documents as input for the RAG (Retrieval-Augmented Generation) system. Understanding file format support and limitations ensures optimal performance and accuracy.

Supported File Formats

PDF Documents

PDF

.pdf

required

Portable Document Format - the only currently supported format for document uploads.

Supported Features:

Text-based PDF files
Multi-page documents
Multiple file uploads simultaneously
Persistent storage across sessions

Configuration in Code: The application explicitly restricts uploads to PDF files only:

uploaded_files = st.file_uploader(
    label='Faça aqui o upload dos seus arquivos: ',
    accept_multiple_files=True,
    type='pdf',  # Only PDF files accepted
)

Source Reference: app.py:97-102

File Processing Details

Text Extraction

RAG Chat uses PyPDFLoader from LangChain to extract text:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(temp_file_path)
docs = loader.load()

How It Works:

Uploads are saved to temporary files with .pdf suffix
PyPDFLoader extracts text from all pages
Text is returned as LangChain Document objects
Temporary files are deleted after processing

Document Chunking

Extracted text is split into manageable chunks for embedding:

chunk_size

int

default:"1000"

Maximum number of characters per chunk.

chunk_overlap

int

default:"400"

Number of overlapping characters between consecutive chunks (40% overlap).

Why Chunking Matters:

Embedding Limits: OpenAI embeddings work best with smaller text segments
Context Preservation: 40% overlap ensures continuity across chunks
Retrieval Accuracy: Smaller chunks improve semantic search precision

Configuration:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=400
)
chunks = text_splitter.split_documents(docs)

Source Reference: app.py:33-37

File Limitations

Format Restrictions

Only PDF files are currently supported. Other formats will be rejected by the file uploader.

Unsupported Formats:

Word documents (.doc, .docx)
Plain text files (.txt)
Markdown files (.md)
HTML files (.html)
Images (.jpg, .png)
Spreadsheets (.xls, .xlsx, .csv)

PDF Type Limitations

Scanned PDFs / Image-Based PDFs

Problem: PDFs that are scanned images without embedded text will not work properly.Why: PyPDFLoader extracts text layers, not images. Scanned documents appear as blank pages.Solution: Use OCR (Optical Character Recognition) to convert scanned PDFs to text-searchable PDFs first, or use a different document loader that includes OCR capabilities.

Password-Protected PDFs

Problem: Encrypted or password-protected PDFs cannot be processed.Why: PyPDFLoader cannot access encrypted content without decryption.Solution: Remove password protection before uploading, or implement password handling in the application.

PDFs with Complex Layouts

Problem: PDFs with tables, multi-column layouts, or heavy formatting may extract poorly.Why: PyPDFLoader extracts text in reading order, which may not match visual layout.Solution:

Pre-process PDFs to simplify layout
Use specialized loaders for structured data
Verify extraction quality with test uploads

Very Large PDFs

Problem: Extremely large files (100+ pages, 50+ MB) may cause performance issues.Why:

Large files take longer to process
Generate many chunks, increasing embedding costs
May exceed memory limits in constrained environments

Solution:

Split large documents into smaller sections
Process documents in batches
Monitor OpenAI API costs for large uploads

Best Practices for Source Documents

Ideal Document Characteristics

Text-Based PDFs

Documents with selectable text, not scanned images

Clear Structure

Well-organized content with headings and paragraphs

Reasonable Size

5-50 pages per document for optimal performance

Relevant Content

Documents focused on topics you want to query

Document Quality Tips

Verify Text Extraction

Test document quality by trying to copy/paste text from your PDF. If you can select and copy text, it will likely work well.

Remove Unnecessary Pages

Delete cover pages, tables of contents, and appendices that don’t contain relevant information. This reduces processing time and costs.

Use Native PDFs When Possible

PDFs exported directly from Word, Google Docs, or LaTeX typically work better than scanned documents.

Check File Size

Keep individual files under 10 MB when possible. Larger files can be split into sections.

Multiple File Uploads

RAG Chat supports uploading multiple PDF files simultaneously:

accept_multiple_files=True  # app.py:99

How It Works:

Upload Multiple Files

Select multiple PDF files in the file uploader interface.

Sequential Processing

Each file is processed one at a time:

for uploaded_file in uploaded_files:
    chunks = process_file(uploaded_file)
    all_chunks.extend(chunks)

Combined Vector Store

All chunks from all files are added to a single vector store:

vector_store = add_to_vector_store(
    vector_store=vector_store,
    documents=all_chunks
)

Unified Search

When asking questions, the system searches across all uploaded documents.

Benefits:

Build a comprehensive knowledge base from multiple sources
Compare information across different documents
Ask questions that span multiple files

Troubleshooting

Upload Rejected - Wrong File Type

Symptoms: File uploader doesn’t accept your fileSolutions:

Verify file has .pdf extension
Check that file is actually a PDF (not renamed from another format)
Try re-exporting as PDF from the original application

PDF Uploaded but No Text Extracted

Symptoms: File processes successfully but responses say “no information available”Solutions:

Check if PDF is scanned/image-based (try selecting text in PDF viewer)
Use OCR software to convert scanned PDF to searchable PDF
Verify PDF isn’t corrupted by opening in multiple PDF readers

Processing Very Slow

Symptoms: File upload takes minutes to completeSolutions:

Split large PDFs into smaller files (10-20 pages each)
Upload fewer files at once
Check file size - consider compressing large PDFs
Ensure stable internet connection for OpenAI API calls

Gibberish or Garbled Text in Responses

Symptoms: Extracted text appears scrambled or nonsensicalSolutions:

PDF may have encoding issues - try re-exporting from source
Check if PDF uses custom fonts that don’t embed properly
For non-English text, verify UTF-8 encoding support
Try opening PDF in different viewer to confirm text quality

Future Format Support

While only PDFs are currently supported, the architecture could be extended to support:

Word Documents

Using UnstructuredWordDocumentLoader from LangChain

Markdown Files

Using UnstructuredMarkdownLoader for documentation

Web Pages

Using WebBaseLoader for online content

Text Files

Using TextLoader for plain text documents

To request additional format support, contribute to the project or submit a feature request on the GitHub repository.

Get Started

Core Concepts

Guides

Reference

Advanced

Supported Formats

Overview

Supported File Formats

PDF Documents

File Processing Details

Text Extraction

Document Chunking

File Limitations

Format Restrictions

PDF Type Limitations

Best Practices for Source Documents

Ideal Document Characteristics

Text-Based PDFs

Clear Structure

Reasonable Size

Relevant Content

Document Quality Tips

Multiple File Uploads

Troubleshooting

Future Format Support

Word Documents

Markdown Files

Web Pages

Text Files

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Reference

Advanced

​Overview

​Supported File Formats

​PDF Documents

​File Processing Details

​Text Extraction

​Document Chunking

​File Limitations

​Format Restrictions

​PDF Type Limitations

​Best Practices for Source Documents

​Ideal Document Characteristics

Text-Based PDFs

Clear Structure

Reasonable Size

Relevant Content

​Document Quality Tips

​Multiple File Uploads

​Troubleshooting

​Future Format Support

Word Documents

Markdown Files

Web Pages

Text Files

​Related Resources

Build docs developers (and LLMs) love

Overview

Supported File Formats

PDF Documents

File Processing Details

Text Extraction

Document Chunking

File Limitations

Format Restrictions

PDF Type Limitations

Best Practices for Source Documents

Ideal Document Characteristics

Document Quality Tips

Multiple File Uploads

Troubleshooting

Future Format Support

Related Resources