Skip to main content

Overview

LiteLLM provides support for HuggingFace models through multiple deployment options: HuggingFace Inference API, dedicated endpoints, and provider-specific routing.

Quick Start

1

Install LiteLLM

pip install litellm
2

Set API Key

export HUGGINGFACE_API_KEY="hf_..."
3

Make Your First Call

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Deployment Options

Use HuggingFace’s serverless Inference API.
from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain AI"}]
)

Authentication

export HUGGINGFACE_API_KEY="hf_..."
# Or
export HF_API_BASE="https://your-endpoint.huggingface.cloud"
from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Chat Completions

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Embeddings

HuggingFace supports various embedding models.
from litellm import embedding

response = embedding(
    model="huggingface/sentence-transformers/all-MiniLM-L6-v2",
    input=["Text to embed", "Another text"]
)

embeddings = [data.embedding for data in response.data]

Reranking

Use HuggingFace reranking models for improved search.
from litellm import rerank

response = rerank(
    model="huggingface/BAAI/bge-reranker-v2-m3",
    query="What is machine learning?",
    documents=[
        "Machine learning is a subset of AI.",
        "Deep learning uses neural networks.",
        "Python is a programming language."
    ],
    top_n=2
)

for result in response.results:
    print(f"Score: {result.relevance_score}")
    print(f"Document: {result.document}")

Provider-Specific Routing

Route requests through different inference providers.
from litellm import completion

# Fireworks AI provider
response = completion(
    model="huggingface/fireworks-ai/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Novita provider
response = completion(
    model="huggingface/novita/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

# HF Inference provider
response = completion(
    model="huggingface/hf-inference/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
Provider availability varies by model. LiteLLM validates provider support automatically.

Configuration

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.8,
    max_tokens=1000,
    top_p=0.95,
    stop=["\n\n"]
)

Supported Parameters

ParameterTypeDescription
temperaturefloatRandomness (0-1)
max_tokensintMax output tokens
top_pfloatNucleus sampling
frequency_penaltyfloatReduce repetition
presence_penaltyfloatEncourage diversity
stoplistStop sequences
streamboolEnable streaming
Not all parameters are supported by all HuggingFace models. Check model documentation.

Error Handling

from litellm import completion
from litellm.exceptions import APIError, RateLimitError

try:
    response = completion(
        model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

LiteLLM Proxy

model_list:
  - model_name: llama-3.3-70b
    litellm_params:
      model: huggingface/meta-llama/Llama-3.3-70B-Instruct
      api_key: os.environ/HUGGINGFACE_API_KEY
  
  - model_name: custom-endpoint
    litellm_params:
      model: huggingface/https://your-endpoint.cloud
      api_key: os.environ/HF_TOKEN
import openai

client = openai.OpenAI(
    api_key="sk-1234",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

  • Use Inference API for testing and prototyping
  • Use dedicated endpoints for production workloads
  • Check model availability on HuggingFace Hub
  • Dedicated endpoints provide better latency
  • Provider routing offers alternative inference options
  • Monitor staging vs production provider status
  • Inference API is free tier available
  • Dedicated endpoints are billed separately
  • Compare provider pricing when routing

Common Models

ModelUse Case
meta-llama/Llama-3.3-70B-InstructGeneral chat
mistralai/Mixtral-8x7B-Instruct-v0.1Advanced reasoning
sentence-transformers/all-MiniLM-L6-v2Embeddings
BAAI/bge-large-en-v1.5Search embeddings
BAAI/bge-reranker-v2-m3Reranking

Build docs developers (and LLMs) love