Overview
LiteLLM provides support for HuggingFace models through multiple deployment options: HuggingFace Inference API, dedicated endpoints, and provider-specific routing.
Quick Start
Set API Key
export HUGGINGFACE_API_KEY = "hf_..."
Make Your First Call
from litellm import completion
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
print (response.choices[ 0 ].message.content)
Deployment Options
Inference API
Dedicated Endpoint
Provider Routing
Use HuggingFace’s serverless Inference API. from litellm import completion
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Explain AI" }]
)
Use your own HuggingFace endpoint URL. from litellm import completion
response = completion(
model = "huggingface/https://your-endpoint.aws.endpoints.huggingface.cloud" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_key = "hf_..."
)
Route through specific providers via HuggingFace Router. from litellm import completion
# Route through Fireworks AI
response = completion(
model = "huggingface/fireworks-ai/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
# Route through Novita
response = completion(
model = "huggingface/novita/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Authentication
Environment Variable
Direct Parameter
export HUGGINGFACE_API_KEY = "hf_..."
# Or
export HF_API_BASE = "https://your-endpoint.huggingface.cloud"
from litellm import completion
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
from litellm import completion
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_key = "hf_..." ,
api_base = "https://api-inference.huggingface.co/models"
)
Chat Completions
from litellm import completion
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "Explain quantum computing" }
],
temperature = 0.7 ,
max_tokens = 500
)
print (response.choices[ 0 ].message.content)
Streaming
from litellm import completion
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Write a story" }],
stream = True
)
for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
Embeddings
HuggingFace supports various embedding models.
Reranking
Use HuggingFace reranking models for improved search.
from litellm import rerank
response = rerank(
model = "huggingface/BAAI/bge-reranker-v2-m3" ,
query = "What is machine learning?" ,
documents = [
"Machine learning is a subset of AI." ,
"Deep learning uses neural networks." ,
"Python is a programming language."
],
top_n = 2
)
for result in response.results:
print ( f "Score: { result.relevance_score } " )
print ( f "Document: { result.document } " )
Provider-Specific Routing
Route requests through different inference providers.
from litellm import completion
# Fireworks AI provider
response = completion(
model = "huggingface/fireworks-ai/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
# Novita provider
response = completion(
model = "huggingface/novita/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
# HF Inference provider
response = completion(
model = "huggingface/hf-inference/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Provider availability varies by model. LiteLLM validates provider support automatically.
Configuration
from litellm import completion
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
temperature = 0.8 ,
max_tokens = 1000 ,
top_p = 0.95 ,
stop = [ " \n\n " ]
)
Supported Parameters
Parameter Type Description temperaturefloat Randomness (0-1) max_tokensint Max output tokens top_pfloat Nucleus sampling frequency_penaltyfloat Reduce repetition presence_penaltyfloat Encourage diversity stoplist Stop sequences streambool Enable streaming
Not all parameters are supported by all HuggingFace models. Check model documentation.
Error Handling
from litellm import completion
from litellm.exceptions import APIError, RateLimitError
try :
response = completion(
model = "huggingface/meta-llama/Llama-3.3-70B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
except RateLimitError as e:
print ( f "Rate limit: { e } " )
except APIError as e:
print ( f "API error: { e.status_code } - { e.message } " )
LiteLLM Proxy
model_list :
- model_name : llama-3.3-70b
litellm_params :
model : huggingface/meta-llama/Llama-3.3-70B-Instruct
api_key : os.environ/HUGGINGFACE_API_KEY
- model_name : custom-endpoint
litellm_params :
model : huggingface/https://your-endpoint.cloud
api_key : os.environ/HF_TOKEN
import openai
client = openai.OpenAI(
api_key = "sk-1234" ,
base_url = "http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model = "llama-3.3-70b" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Best Practices
Use Inference API for testing and prototyping
Use dedicated endpoints for production workloads
Check model availability on HuggingFace Hub
Inference API is free tier available
Dedicated endpoints are billed separately
Compare provider pricing when routing
Common Models
Model Use Case meta-llama/Llama-3.3-70B-InstructGeneral chat mistralai/Mixtral-8x7B-Instruct-v0.1Advanced reasoning sentence-transformers/all-MiniLM-L6-v2Embeddings BAAI/bge-large-en-v1.5Search embeddings BAAI/bge-reranker-v2-m3Reranking