Installation & setup
How do I run h2oGPT with a LLaMa-3 or chat-template-based model?
How do I run h2oGPT with a LLaMa-3 or chat-template-based model?
How do I control where model files are stored?
How do I control where model files are stored?
HUGGINGFACE_HUB_CACHE— HuggingFace model hub cache (default:~/.cache/huggingface/hub)TRANSFORMERS_CACHE— HuggingFace transformers cache (default:~/.cache/huggingface/transformers)HF_HOME— Broad location for any HF objectsXDG_CACHE_HOME— Broadly any~/.cacheitems--llamacpp_path=<location>— CLI flag for llama.cpp / GGUF model files
How do I set a Hugging Face access token?
How do I set a Hugging Face access token?
--use_auth_token=True when starting h2oGPT:What environment variables can I use to configure h2oGPT?
What environment variables can I use to configure h2oGPT?
| Variable | Description |
|---|---|
SAVE_DIR | Local directory to save logs |
ADMIN_PASS | Password for system info and log access |
HUGGING_FACE_HUB_TOKEN | HF token for private models |
LANGCHAIN_MODE | LangChain mode override |
CUDA_VISIBLE_DEVICES | Comma-separated list of CUDA device IDs to expose |
CONCURRENCY_COUNT | Number of concurrent Gradio users (1 is fastest for single GPU) |
ALLOW_API | Whether API access is permitted |
H2OGPT_BASE_PATH | Base folder for all non-personal files |
LLAMACPP_PATH | Directory for llama.cpp URL downloads |
generate.py CLI argument --foo_bar=value can also be set as H2OGPT_FOO_BAR=value.How do I use h2oGPT purely as an LLM controller without embeddings or GPU?
How do I use h2oGPT purely as an LLM controller without embeddings or GPU?
--gpus none instead of the env var.Models & hardware
How do I load a model with 4-bit or 8-bit quantization?
How do I load a model with 4-bit or 8-bit quantization?
--load_8bit=True. GGUF models provide the most control over GPU offloading and work on both CPU and GPU.How do I use multiple GPUs?
How do I use multiple GPUs?
--use_gpu_id=False. Select specific GPU IDs via CUDA_VISIBLE_DEVICES:--use_gpu_id=False is disabled by default because in rare cases torch can hit a cuda:x cuda:y mismatch bug.For models and tasks that use separate GPU assignments (embedding, captioning, ASR, etc.):What are the low-memory mode options?
What are the low-memory mode options?
- Use quantized models: GGUF, AWQ, GPTQ, or bitsandbytes 4-bit
- Run embedding on CPU:
--pre_load_embedding_model=True --embedding_gpu_id=cpu - Use a smaller embedding model:
--hf_embedding_model=BAAI/bge-base-en-v1.5 --cut_distance=10000 - Disable scoring:
--score_model=None - Disable speech features:
--enable_tts=False --enable_stt=False --enable_transcriptions=False - Limit sequence length:
--max_seq_len=4096 - For GGUF, limit GPU layers:
--n_gpu_layers=10 - Reduce document chunks in context:
--top_k_docs=3
How do I run T5 or sequence-to-sequence models?
How do I run T5 or sequence-to-sequence models?
--force_seq2seq_type=True or --force_t5_type=True:CohereForAI/aya-101 is already auto-detected as a T5 conditional model.How do I add a custom prompt template for a new model?
How do I add a custom prompt template for a new model?
- Add a new key/value to
prompt_type_to_model_nameinprompter.py - Add a new enum entry in
enums.py - Add a new block in
get_prompt()inprompter.py
Document Q&A
How do I control the quality and speed of document parsing?
How do I control the quality and speed of document parsing?
--use_unstructured_pdf='auto'— set'off'to disable,'on'to force--use_pypdf='auto'--enable_pdf_ocr='auto'--enable_pdf_doctr='auto'--try_pdf_as_html='auto'
How does top_k_docs affect context quality?
How does top_k_docs affect context quality?
--top_k_docs controls how many document chunks fill the LLM context.- Default:
3(fast, lower quality) --top_k_docs=10— balanced for most use cases--top_k_docs=-1— autofill context (best quality, slower)
--top_k_docs=-1, you can bound token usage:--max_input_tokens=3000— cap per-call tokens--max_total_input_tokens=16000— cap total tokens across all calls
get_limited_prompt().How do I set up the Text Embedding Inference (TEI) server?
How do I set up the Text Embedding Inference (TEI) server?
413 Payload Too Large errors:How do I migrate a Chroma database from version < 0.4 to >= 0.4?
How do I migrate a Chroma database from version < 0.4 to >= 0.4?
db_dir_UserData), and choose a new name like db_dir_UserData_mig.After migration completes:How do I use h2oGPT with non-English languages?
How do I use h2oGPT with non-English languages?
API & integration
How do I configure API key access?
How do I configure API key access?
h2ogpt_api_keys.json is a JSON array of allowed key strings.How do I connect to inference servers like vLLM, Ollama, or Anthropic?
How do I connect to inference servers like vLLM, Ollama, or Anthropic?
--inference_server. Examples:| Provider | Example value |
|---|---|
| Ollama | vllm_chat:http://localhost:11434/v1/ |
| vLLM | vllm:111.111.111.111:5005 |
| Anthropic Claude | anthropic (set ANTHROPIC_API_KEY) |
| OpenAI | openai_chat (set OPENAI_API_KEY) |
| Google Gemini | google (set GOOGLE_API_KEY) |
| Groq | groq (set GROQ_API_KEY) |
| MistralAI | mistralai |
How do I get JSON output or use guided generation?
How do I get JSON output or use guided generation?
response_format or guided generation parameters:response_format=json_object— best-effort JSON for any modelresponse_format=json_code— JSON via code block extraction, works on most modelsguided_json=<schema>— strict schema enforcement (requires vLLM >= 0.4.0 or Anthropic Claude 3)guided_regex,guided_choice,guided_grammar— other constraint types (vLLM >= 0.4.0 only)
guided_json:How do I launch parallel OpenAI proxy or ingestion servers?
How do I launch parallel OpenAI proxy or ingestion servers?
How do I enable HTTPS for both server and client?
How do I enable HTTPS for both server and client?
Performance & memory
How do I speed up ASR (automatic speech recognition)?
How do I speed up ASR (automatic speech recognition)?
distil-whisper/distil-large-v3 for approximately 10× faster inference at similar accuracy:faster_whisper for approximately 4× faster and 2× lower memory usage.How do I use attention sinks for very long outputs?
How do I use attention sinks for very long outputs?
window_length must be larger than any single prompt input. Set --max_input_tokens to the same value to enforce this.How do I throttle GPU power to prevent resets?
How do I throttle GPU power to prevent resets?
nvidia-smi to set a lower power limit:What does the model name number (e.g. '512') mean?
What does the model name number (e.g. '512') mean?
h2oai/h2ogpt-oasst1-512-20b was trained with a cutoff of 512 tokens.Shorter cutoffs result in faster training and more focus on the tail of the input. For fine-tuning your own data, a cutoff of 512 is reasonable for most instruction datasets.Is h2oGPT multilingual?
Is h2oGPT multilingual?
- Use a language-specific LLM (e.g.
LeoLMfor German,JAISfor Arabic) - Use a matching embedding model (e.g.
BAAI/bge-base-zh-v1.5for Chinese) - Translate the query and summary prompts (
--pre_prompt_query,--prompt_summary, etc.) - Set a native-language system prompt via
--system_prompt