Overview
RAG Chat currently supports PDF documents as input for the RAG (Retrieval-Augmented Generation) system. Understanding file format support and limitations ensures optimal performance and accuracy.Supported File Formats
PDF Documents
Portable Document Format - the only currently supported format for document uploads.
- Text-based PDF files
- Multi-page documents
- Multiple file uploads simultaneously
- Persistent storage across sessions
app.py:97-102
File Processing Details
Text Extraction
RAG Chat uses PyPDFLoader from LangChain to extract text:- Uploads are saved to temporary files with
.pdfsuffix - PyPDFLoader extracts text from all pages
- Text is returned as LangChain Document objects
- Temporary files are deleted after processing
Document Chunking
Extracted text is split into manageable chunks for embedding:Maximum number of characters per chunk.
Number of overlapping characters between consecutive chunks (40% overlap).
- Embedding Limits: OpenAI embeddings work best with smaller text segments
- Context Preservation: 40% overlap ensures continuity across chunks
- Retrieval Accuracy: Smaller chunks improve semantic search precision
app.py:33-37
File Limitations
Format Restrictions
Unsupported Formats:- Word documents (
.doc,.docx) - Plain text files (
.txt) - Markdown files (
.md) - HTML files (
.html) - Images (
.jpg,.png) - Spreadsheets (
.xls,.xlsx,.csv)
PDF Type Limitations
Scanned PDFs / Image-Based PDFs
Scanned PDFs / Image-Based PDFs
Problem: PDFs that are scanned images without embedded text will not work properly.Why: PyPDFLoader extracts text layers, not images. Scanned documents appear as blank pages.Solution: Use OCR (Optical Character Recognition) to convert scanned PDFs to text-searchable PDFs first, or use a different document loader that includes OCR capabilities.
Password-Protected PDFs
Password-Protected PDFs
Problem: Encrypted or password-protected PDFs cannot be processed.Why: PyPDFLoader cannot access encrypted content without decryption.Solution: Remove password protection before uploading, or implement password handling in the application.
PDFs with Complex Layouts
PDFs with Complex Layouts
Problem: PDFs with tables, multi-column layouts, or heavy formatting may extract poorly.Why: PyPDFLoader extracts text in reading order, which may not match visual layout.Solution:
- Pre-process PDFs to simplify layout
- Use specialized loaders for structured data
- Verify extraction quality with test uploads
Very Large PDFs
Very Large PDFs
Problem: Extremely large files (100+ pages, 50+ MB) may cause performance issues.Why:
- Large files take longer to process
- Generate many chunks, increasing embedding costs
- May exceed memory limits in constrained environments
- Split large documents into smaller sections
- Process documents in batches
- Monitor OpenAI API costs for large uploads
Best Practices for Source Documents
Ideal Document Characteristics
Text-Based PDFs
Documents with selectable text, not scanned images
Clear Structure
Well-organized content with headings and paragraphs
Reasonable Size
5-50 pages per document for optimal performance
Relevant Content
Documents focused on topics you want to query
Document Quality Tips
Verify Text Extraction
Test document quality by trying to copy/paste text from your PDF. If you can select and copy text, it will likely work well.
Remove Unnecessary Pages
Delete cover pages, tables of contents, and appendices that don’t contain relevant information. This reduces processing time and costs.
Use Native PDFs When Possible
PDFs exported directly from Word, Google Docs, or LaTeX typically work better than scanned documents.
Multiple File Uploads
RAG Chat supports uploading multiple PDF files simultaneously:
Benefits:
- Build a comprehensive knowledge base from multiple sources
- Compare information across different documents
- Ask questions that span multiple files
Troubleshooting
Upload Rejected - Wrong File Type
Upload Rejected - Wrong File Type
Symptoms: File uploader doesn’t accept your fileSolutions:
- Verify file has
.pdfextension - Check that file is actually a PDF (not renamed from another format)
- Try re-exporting as PDF from the original application
PDF Uploaded but No Text Extracted
PDF Uploaded but No Text Extracted
Symptoms: File processes successfully but responses say “no information available”Solutions:
- Check if PDF is scanned/image-based (try selecting text in PDF viewer)
- Use OCR software to convert scanned PDF to searchable PDF
- Verify PDF isn’t corrupted by opening in multiple PDF readers
Processing Very Slow
Processing Very Slow
Symptoms: File upload takes minutes to completeSolutions:
- Split large PDFs into smaller files (10-20 pages each)
- Upload fewer files at once
- Check file size - consider compressing large PDFs
- Ensure stable internet connection for OpenAI API calls
Gibberish or Garbled Text in Responses
Gibberish or Garbled Text in Responses
Symptoms: Extracted text appears scrambled or nonsensicalSolutions:
- PDF may have encoding issues - try re-exporting from source
- Check if PDF uses custom fonts that don’t embed properly
- For non-English text, verify UTF-8 encoding support
- Try opening PDF in different viewer to confirm text quality
Future Format Support
While only PDFs are currently supported, the architecture could be extended to support:Word Documents
Using
UnstructuredWordDocumentLoader from LangChainMarkdown Files
Using
UnstructuredMarkdownLoader for documentationWeb Pages
Using
WebBaseLoader for online contentText Files
Using
TextLoader for plain text documentsTo request additional format support, contribute to the project or submit a feature request on the GitHub repository.