Skip to main content
Khoj supports indexing a wide variety of file formats, making it easy to search and chat with your documents. You can upload files directly through the web interface, use the desktop app for automatic syncing, or integrate with your favorite note-taking tools.

Supported File Formats

Khoj can process and index the following file types:

PDF Documents

Portable Document Format files (.pdf)

Word Documents

Microsoft Word files (.docx)

Markdown

Markdown files (.md)

Org Mode

Emacs Org mode files (.org)

Plain Text

Text files (.txt) and code files

HTML/XML

Web pages (.html, .htm, .xml)

Uploading Files

There are several ways to share your documents with Khoj:

Web Interface

The easiest way to get started is by uploading files directly through the web UI.
1

Navigate to the Search Page

2

Click 'Add Documents'

Look for the “Add Documents” button in the interface
3

Drag and Drop or Select Files

You can either drag files directly into the upload area or click to browse your files
4

Wait for Processing

Khoj will automatically process and index your documents, making them searchable
Upload documents by dragging and dropping
The web interface is perfect for one-off documents you need to interact with quickly.

Desktop App

For continuous syncing of local documents, the desktop app provides the most seamless experience.

Desktop App

Set up automatic document syncing with the Khoj desktop application
The desktop app is ideal if you have many documents on your computer or need them to stay in sync automatically.

Editor Integrations

If you use Obsidian or Emacs for note-taking, you can configure automatic syncing:

Obsidian Plugin

Sync your Obsidian vault with Khoj

Emacs Integration

Integrate Khoj with your Emacs workflow

How Files Are Processed

When you upload files to Khoj, they go through several processing steps:
  1. Text Extraction: Content is extracted from the file format (e.g., text from PDFs, content from Word documents)
  2. Chunking: Large documents are split into smaller, manageable chunks (typically ~256 tokens) for better search results
  3. Embedding: Each chunk is converted into a vector embedding using AI models
  4. Indexing: The embeddings are stored in the search index, making your content searchable
Khoj automatically handles documents with special characters and multiple languages. The chunking process preserves context by maintaining headings and document structure.

File Limits and Performance

Very long words (over 500 characters) are automatically removed during processing to maintain quality. This typically only affects corrupted or malformed files.

Best Practices

  • Organize your documents: Use clear filenames and folder structures for easier reference
  • Keep files updated: Re-upload or sync files when you make significant changes
  • Use appropriate formats: PDFs and Word documents work well for formatted content, while Markdown and plain text are great for notes
  • Avoid duplicate uploads: The desktop app and editor integrations handle this automatically

Searching Your Files

Once indexed, you can:
  • Search: Find specific information across all your documents
  • Chat: Ask questions about your documents and get contextual answers
  • Get references: Khoj provides file paths and line numbers for source attribution

Search Documentation

Learn more about searching your indexed content

Privacy and Security

Your documents are processed and stored securely. When using Khoj Cloud, your data is encrypted in transit and at rest. For maximum privacy, you can also self-host Khoj.

Self-Hosting

Host Khoj on your own infrastructure for complete data control

Troubleshooting

  • Wait a few minutes for processing to complete
  • Check that the file format is supported
  • Verify the file isn’t empty or corrupted
  • Try re-uploading the file

Slow processing for large files

  • Large PDFs with many pages may take several minutes to process
  • Consider splitting very large documents into smaller files
  • The first sync with many files will take longer; subsequent syncs are faster

Character encoding issues

  • Khoj handles UTF-8 encoded files by default
  • If you see garbled text, ensure your files are saved with UTF-8 encoding

Build docs developers (and LLMs) love