Common errors
UI loads forever / blank screen in Chrome
UI loads forever / blank screen in Chrome
TypeError: Chroma.init() got an unexpected keyword argument 'anonymized_telemetry'
TypeError: Chroma.init() got an unexpected keyword argument 'anonymized_telemetry'
langchain or chromadb is installed.Fix: Check requirements.txt for the correct pinned version and reinstall:ValueError: offload — weights offloaded to disk
ValueError: offload — weights offloaded to disk
--max_seq_len.CUDA error 704 — peer access already enabled (multi-GPU GGUF)
CUDA error 704 — peer access already enabled (multi-GPU GGUF)
llama.cpp on some multi-GPU systems.Fix: Restrict to a single GPU:GGML version mismatch — 'Is this really a GGML file?'
GGML version mismatch — 'Is this really a GGML file?'
llama-cpp-python only supports version 3.Fix option 1: Downgrade llama-cpp-python to the version that supports the old format:Assertion failed: srcIndex < srcSelectDimSize (distilgpt2 and similar)
Assertion failed: srcIndex < srcSelectDimSize (distilgpt2 and similar)
distilgpt2 is a known example.Fix: Enable truncation:CPU memory error — 'not enough memory: you tried to allocate N bytes'
CPU memory error — 'not enough memory: you tried to allocate N bytes'
GPU/CUDA issues
vLLM fails with 'cuda>=12.4' driver requirement error
vLLM fails with 'cuda>=12.4' driver requirement error
bitsandbytes CUDA Setup failed despite GPU being available
bitsandbytes CUDA Setup failed despite GPU being available
bitsandbytes version.Fix: Check and select the correct CUDA alternative on Ubuntu:bitsandbytes to match your CUDA version, or uninstall it to disable 4-bit/8-bit support:llama.cpp + XTTS audio streaming: 'CUDA error: an illegal memory access was encountered'
llama.cpp + XTTS audio streaming: 'CUDA error: an illegal memory access was encountered'
llama_cpp_python >= 0.2.76, thread safety is degraded. Concurrent XTTS audio streaming and GGUF token streaming can cause this crash.h2oGPT has a built-in workaround that serializes these operations, but it adds latency.Fix option 1: Use an inference server (vLLM, Ollama, etc.) instead of llama.cpp directly.Fix option 2: Downgrade llama_cpp_python to 0.2.26:Heterogeneous GPU peer-to-peer errors
Heterogeneous GPU peer-to-peer errors
Memory issues
Pinned memory error: 'failed to allocate N MB of pinned memory: out of memory'
Pinned memory error: 'failed to allocate N MB of pinned memory: out of memory'
TEI returns '413 Payload Too Large'
TEI returns '413 Payload Too Large'
--max-batch-tokens. This is common on smaller GPUs like the Tesla T4.Fix: Reduce the batch size via the TEI_MAX_BATCH_SIZE environment variable:client_batch_size × 512 must be less than or equal to --max-batch-tokens.Ollama running slowly — 'cudaMalloc failed: out of memory'
Ollama running slowly — 'cudaMalloc failed: out of memory'
ollama.log for:nvidia-smi to inspect GPU memory usage:'The model OptimizedModule is not supported' warning
'The model OptimizedModule is not supported' warning
Docker issues
Gradio fails across pods in Kubernetes (nginx / multi-pod)
Gradio fails across pods in Kubernetes (nginx / multi-pod)
vLLM Docker container fails to start due to driver version
vLLM Docker container fails to start due to driver version
--shm-size is large enough for your model (10+ GB is typical for large models).Example working Docker run for a 70B model on 4 GPUs:llama-cpp-python fails to install on CentOS
llama-cpp-python fails to install on CentOS
llama-cpp-python.Fix: Install devtoolset-11 and use GCC 11:Security warning: key files accessible via Gradio's allowed_paths
Security warning: key files accessible via Gradio's allowed_paths
GPT_H2O_AI=1 to enable public-instance mode which restricts file access.