Supported file types
Native document types
These formats are ingested directly without optional dependencies:.pdf, .txt, .csv, .toml, .py, .rst, .rtf, .md, .html, .mhtml, .htm, .docx, .doc, .xlsx, .xls, .enex, .eml, .epub, .odt, .pptx, .ppt, .xml
Image types (optional)
When vision/OCR dependencies are installed, h2oGPT can extract text from images:.jpg, .jpeg, .png, .gif, .bmp, .webp, .tiff, .tif, .svg, .psd, and many more (over 50 image formats via Pillow).
Audio and video (optional)
.mp4, .mpeg, .mpg, .mp3, .ogg, .flac, .aac, .au
Audio and video files are transcribed using Whisper before being stored in the database.
Meta types
| Type | Description |
|---|---|
.zip | Archive containing any native data type. |
.urls | Plain text file with one URL per line; each URL is fetched and ingested. |
UI input sources
Beyond file uploads, the UI also accepts:- URL — Any
http://orhttps://address. h2oGPT fetches and parses the page. - ArXiv — Enter an ArXiv identifier such as
arXiv:1706.03762. - Text — Paste raw text directly into the UI.
If you upload a zip file that contains images or PDFs to be processed by DocTR or Florence-2, upload the zip separately. Uploading it alongside other files triggers a CUDA multiprocessing error in forked subprocesses.
Vector databases
h2oGPT supports several vector database backends. Choose one by passing--db_type to generate.py or make_db.py.
- Chroma (default)
- FAISS
- Weaviate
- Qdrant
Chroma is the default. No extra flags are required.Chroma stores its files in
db_dir_UserData/ by default. It supports filtering by document, making it the most feature-complete option for the h2oGPT UI.Uploading documents via the UI
Select a collection
In the left sidebar under Resources → Collections, choose where uploaded documents will be stored. For a shared persistent collection use UserData; for a private temporary collection use MyData.
Upload files
Click the upload area on the main chat panel or navigate to the Document Selection tab and drag-drop files there. The UI shows progress in stdout during parallel ingestion.
Wait for ingestion
Embedding runs on GPU if available. The Doc Counts field updates when ingestion is complete.
Building a collection from the CLI
Usesrc/make_db.py to build or update a vector database outside of the running chatbot.
Build a new database
Add documents to an existing database
Download and use example databases
Build multiple collections with different embeddings
You can maintain separate collections indexed with different embedding models:Limit ingestion to specific file types
--selected_file_types:
HYDE (Hypothetical Document Embeddings)
HYDE improves retrieval for vague or underspecified questions. Instead of embedding the raw user query, h2oGPT first asks the LLM to generate a hypothetical answer, then embeds that answer to find better-matching document chunks.- HYDE Level 0 — Normal retrieval: embed the user’s query directly.
- HYDE Level 1+ — Perform one or more HYDE iterations before the final retrieval step.
Semantic chunking
When a GPU is available, h2oGPT can use semantic chunking to split documents at meaningful boundaries rather than fixed token counts. This generally improves retrieval accuracy for long documents. Semantic chunking is enabled automatically when GPU resources are present. Disable it or tune it in the Expert tab under Document Control.Embedding models
The embedding model converts text chunks into vectors for similarity search. h2oGPT uses a single embedding model per collection — you cannot mix embeddings within one database.| Model | Default for | Notes |
|---|---|---|
instructor-large | GPU | Most accurate. Produces a flat score distribution, so reference scores appear high. |
all-MiniLM-L6-v2 | CPU | Faster, lower memory. Sharper score distribution; references are more intuitive. |
BAAI/bge-large-en-v1.5 | Recommended for PDF-heavy workloads | Good accuracy/speed tradeoff. |
BAAI/bge-small-en-v1.5 | Speed-optimized | Useful for large PDF ingest pipelines. |
BAAI/bge-m3 | Multilingual | Higher memory usage; set CHROMA_MAX_BATCH_SIZE=1 if GPU runs out of memory. |
Key generate.py flags
| Flag | Description |
|---|---|
--langchain_mode | Default collection to use on startup (e.g. UserData, MyData, LLM). |
--langchain_modes | List of collections to expose in the UI. |
--user_path | Directory to scan for documents when building or updating UserData. |
--hf_embedding_model | HuggingFace model ID or TEI endpoint for embeddings. |
--db_type | Vector database backend: chroma (default), faiss, weaviate, qdrant. |
--allow_upload_to_user_data | Allow UI uploads to the shared UserData collection. |
--allow_upload_to_my_data | Allow UI uploads to the private MyData collection. |
--pre_load_embedding_model | Load the embedding model at startup for faster first ingestion. |