CPU inference

h2oGPT supports CPU-only inference via llama.cpp (GGUF models) and GPT4All. CPU inference is slower than GPU but requires no NVIDIA hardware and works on any machine.

llama.cpp / GGUF models

GGUF is the quantized model format used by llama.cpp. It is the recommended approach for CPU inference. Models are available in multiple quantization levels (e.g. Q4_K_M, Q5_K_M, Q6_K) — higher Q values mean better quality but larger file size and slower inference.

Quick start with LLaMa-2 7B

Download the GGUF model

wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguf

Run with h2oGPT

With documents in a user_path folder:

python generate.py \
  --base_model=llama \
  --prompt_type=llama2 \
  --score_model=None \
  --langchain_mode=UserData \
  --user_path=user_path

Loading a GGUF model from a URL

You can pass a HuggingFace download URL directly via --model_path_llama:

python generate.py \
  --base_model=llama \
  --model_path_llama=https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguf \
  --score_model=None \
  --langchain_mode=UserData \
  --user_path=user_path

Low-memory and slow CPU settings

For machines with limited RAM or slow CPUs, reduce memory locking and batch size, and limit the context window:

python generate.py \
  --base_model=llama \
  --model_path_llama=https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguf \
  --llamacpp_dict="{'use_mlock':False,'n_batch':256}" \
  --max_seq_len=512 \
  --score_model=None \
  --langchain_mode=UserData \
  --user_path=user_path

Key options via --llamacpp_dict:

Option	Description
`use_mlock`	Lock model in RAM (set `False` to allow paging)
`n_batch`	Prompt processing batch size (lower = less RAM pressure)
`n_gpu_layers`	Number of layers to offload to GPU (0 = CPU only)

Controlling context length

Pass --max_seq_len to set a smaller context window for faster CPU processing:

python generate.py --base_model=llama --max_seq_len=2048

If --max_seq_len is not set, h2oGPT uses a large default value that will be slow on CPU.

LLaMa-3 and models with chat templates

For models that use a HuggingFace chat template (LLaMa-3, Phi-3, etc.), pass the HF tokenizer alongside the GGUF file to ensure accurate prompting:

python generate.py \
  --base_model=llama \
  --model_path_llama=https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf \
  --tokenizer_base_model=meta-llama/Meta-Llama-3-8B-Instruct \
  --max_seq_len=8192

For Phi-3:

python generate.py \
  --base_model=llama \
  --llama_cpp_model=https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf \
  --tokenizer_base_model=microsoft/Phi-3-mini-4k-instruct \
  --max_seq_len=4096

GPT4All models

GPT4All provides a simple interface for running quantized models without managing GGUF files manually. Models download automatically at runtime into .cache. Browse available models at the GPT4All model explorer.

GPT-J model
LLaMa-based GPT4All

python generate.py \
  --base_model=gptj \
  --model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin \
  --score_model=None \
  --langchain_mode=UserData \
  --user_path=user_path

python generate.py \
  --base_model=gpt4all_llama \
  --model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin \
  --score_model=None \
  --langchain_mode=UserData \
  --user_path=user_path

The gptj model often produces no output even outside h2oGPT. The gpt4all_llama variant is generally more reliable for CPU inference.

macOS Apple Silicon (M1 / M2 Metal)

llama.cpp supports Metal acceleration on Apple Silicon. h2oGPT runs GGUF models with MPS (Metal Performance Shaders) on M1 and M2 chips, offering significant speedups over pure CPU execution. Install h2oGPT following the macOS installation guide. When llama-cpp-python is compiled with Metal support, MPS acceleration is used automatically.

On Apple Silicon, GGUF models at Q4_K_M or Q5_K_M quantization provide a good balance of speed and quality for Metal inference.

Building llama.cpp with CPU optimizations

The llama.cpp build can be tuned for your CPU architecture using CMAKE_ARGS. The following example enables AVX2 and AVX-512 for modern x86_64 CPUs:

CMAKE_ARGS="-DLLAMA_AVX2=ON -DLLAMA_F16C=ON -DLLAMA_FMA=ON" pip install llama-cpp-python

For macOS Metal:

CMAKE_ARGS="-DLLAMA_METAL=ON" pip install llama-cpp-python

For CPU-only (no CUDA, no Metal):

CMAKE_ARGS="-DLLAMA_CUBLAS=OFF" pip install llama-cpp-python

Once llama-cpp-python is compiled with CUDA support, it will not work in CPU-only mode. If you switch between GPU and CPU environments, maintain separate Python environments or reinstall the package with the appropriate CMAKE_ARGS.

Memory requirements

Approximate RAM required for GGUF models at common quantization levels:

Model size	Q4_K_M	Q5_K_M	Q6_K
7B	~4.1 GB	~4.8 GB	~5.5 GB
13B	~7.5 GB	~8.7 GB	~10.1 GB
70B	~38 GB	~44 GB	~51 GB

For additional low-memory strategies, see the Low Memory section in the FAQ.

Get Started

Core Features

Models & Backends

Advanced Usage

Help

llama.cpp / GGUF models

Quick start with LLaMa-2 7B

Loading a GGUF model from a URL

Low-memory and slow CPU settings

Controlling context length

LLaMa-3 and models with chat templates

GPT4All models

macOS Apple Silicon (M1 / M2 Metal)

Building llama.cpp with CPU optimizations

Memory requirements

Build docs developers (and LLMs) love

Get Started

Core Features

Models & Backends

Advanced Usage

Help

​llama.cpp / GGUF models

​Quick start with LLaMa-2 7B

​Loading a GGUF model from a URL

​Low-memory and slow CPU settings

​Controlling context length

​LLaMa-3 and models with chat templates

​GPT4All models

​macOS Apple Silicon (M1 / M2 Metal)

​Building llama.cpp with CPU optimizations

​Memory requirements

Build docs developers (and LLMs) love

llama.cpp / GGUF models

Quick start with LLaMa-2 7B

Loading a GGUF model from a URL

Low-memory and slow CPU settings

Controlling context length

LLaMa-3 and models with chat templates

GPT4All models

macOS Apple Silicon (M1 / M2 Metal)

Building llama.cpp with CPU optimizations

Memory requirements