Model families
LLaMa 2 / LLaMa 3
Meta’s open-weight models, from 7B to 70B parameters. Supported in full precision, 8-bit, 4-bit, AutoGPTQ, AutoAWQ, and GGUF formats.
Mistral / Mixtral
Mistral 7B and Mixtral 8×7B instruction-tuned variants. Supported via HuggingFace, vLLM, and the MistralAI inference API.
Falcon
TII Falcon 7B, 40B, and instruct variants. Supported via HuggingFace Transformers and HF TGI server.
Zephyr
HuggingFace H4 Zephyr models (e.g.
zephyr-7b-beta). Recommended for production 7B usage with 8-bit loading.Vicuna / WizardLM
Instruction-tuned variants of LLaMa. Use
--prompt_type=instruct_vicuna or --prompt_type=wizard2 respectively.Nous Hermes / OpenChat
Community fine-tunes with strong instruction following. Use
--prompt_type=instruct or the auto-detected chat template.GPT4All / GGUF (CPU)
Quantized GGUF models via llama.cpp and GPT4All for CPU-only or low-memory environments.
AutoGPTQ / AutoAWQ
4-bit quantized models for reduced GPU memory. AutoGPTQ uses GPTQ kernels; AutoAWQ uses activation-aware quantization.
--base_model and --prompt_type
The two most important CLI flags for model loading are --base_model and --prompt_type.
--base_model accepts:
- A HuggingFace Hub model name, e.g.
HuggingFaceH4/zephyr-7b-beta - A local filesystem path, e.g.
/path/to/my-model - The special token
llamato load a llama.cpp GGUF file - The special token
gptjorgpt4all_llamafor GPT4All models
--prompt_type tells h2oGPT how to format prompts for the model. It must match the model’s training format. Common values are:
| Value | Used for |
|---|---|
llama2 | LLaMa-2 chat models |
instruct | Generic instruct-tuned models |
instruct_vicuna | Vicuna-style models |
human_bot | h2oGPT-trained models |
zephyr | Zephyr models |
openai_chat | OpenAI chat API or oLLaMa |
plain | No special formatting |
tokenizer_config.json, h2oGPT auto-detects the template and you can omit --prompt_type.
--base_model to a local path, you must specify --prompt_type explicitly:
GGUF / llama.cpp models
GGUF (quantized) models run via llama.cpp. Use--base_model=llama and supply the model path or URL via --model_path_llama:
AutoGPTQ models
AutoGPTQ provides 4-bit quantized models with optional RoPE scaling for extended context. Pass--load_gptq with the model’s weight file name:
AutoAWQ models
AutoAWQ (activation-aware weight quantization) offers efficient 4-bit quantization with good quality. Use--load_awq:
GPT4All models
GPT4All models are downloaded automatically at runtime into.cache:
The
gptj GPT4All model sometimes produces no output. The gpt4all_llama variant is generally more reliable.Model download and management via UI
You do not have to specify models on the CLI. When h2oGPT is running, open the Models tab in the UI:- Enter the HuggingFace model name (same as
--base_model) in the base model field. - Optionally enter a server URL for remote inference.
- Click Add new Model, Lora, Server url:port.
- Select the model in the dropdown and click Load-Unload to activate it.