HuggingFace

Overview

LiteLLM provides support for HuggingFace models through multiple deployment options: HuggingFace Inference API, dedicated endpoints, and provider-specific routing.

Quick Start

Install LiteLLM

pip install litellm

Set API Key

export HUGGINGFACE_API_KEY="hf_..."

Make Your First Call

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Deployment Options

Inference API
Dedicated Endpoint
Provider Routing

Use HuggingFace’s serverless Inference API.

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain AI"}]
)

Use your own HuggingFace endpoint URL.

from litellm import completion

response = completion(
    model="huggingface/https://your-endpoint.aws.endpoints.huggingface.cloud",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key="hf_..."
)

Route through specific providers via HuggingFace Router.

from litellm import completion

# Route through Fireworks AI
response = completion(
    model="huggingface/fireworks-ai/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Route through Novita
response = completion(
    model="huggingface/novita/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Authentication

Environment Variable
Direct Parameter

export HUGGINGFACE_API_KEY="hf_..."
# Or
export HF_API_BASE="https://your-endpoint.huggingface.cloud"

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key="hf_...",
    api_base="https://api-inference.huggingface.co/models"
)

Chat Completions

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Embeddings

HuggingFace supports various embedding models.

Sentence Transformers
BGE Models
Custom Endpoint

from litellm import embedding

response = embedding(
    model="huggingface/sentence-transformers/all-MiniLM-L6-v2",
    input=["Text to embed", "Another text"]
)

embeddings = [data.embedding for data in response.data]

from litellm import embedding

response = embedding(
    model="huggingface/BAAI/bge-large-en-v1.5",
    input=["Query text"]
)

from litellm import embedding

response = embedding(
    model="huggingface/https://your-embedding-endpoint.cloud",
    input=["Text to embed"],
    api_key="hf_..."
)

Reranking

Use HuggingFace reranking models for improved search.

from litellm import rerank

response = rerank(
    model="huggingface/BAAI/bge-reranker-v2-m3",
    query="What is machine learning?",
    documents=[
        "Machine learning is a subset of AI.",
        "Deep learning uses neural networks.",
        "Python is a programming language."
    ],
    top_n=2
)

for result in response.results:
    print(f"Score: {result.relevance_score}")
    print(f"Document: {result.document}")

Provider-Specific Routing

Route requests through different inference providers.

from litellm import completion

# Fireworks AI provider
response = completion(
    model="huggingface/fireworks-ai/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Novita provider
response = completion(
    model="huggingface/novita/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

# HF Inference provider
response = completion(
    model="huggingface/hf-inference/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Provider availability varies by model. LiteLLM validates provider support automatically.

Configuration

from litellm import completion

response = completion(
    model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.8,
    max_tokens=1000,
    top_p=0.95,
    stop=["\n\n"]
)

Supported Parameters

Parameter	Type	Description
`temperature`	float	Randomness (0-1)
`max_tokens`	int	Max output tokens
`top_p`	float	Nucleus sampling
`frequency_penalty`	float	Reduce repetition
`presence_penalty`	float	Encourage diversity
`stop`	list	Stop sequences
`stream`	bool	Enable streaming

Not all parameters are supported by all HuggingFace models. Check model documentation.

Error Handling

from litellm import completion
from litellm.exceptions import APIError, RateLimitError

try:
    response = completion(
        model="huggingface/meta-llama/Llama-3.3-70B-Instruct",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

LiteLLM Proxy

model_list:
  - model_name: llama-3.3-70b
    litellm_params:
      model: huggingface/meta-llama/Llama-3.3-70B-Instruct
      api_key: os.environ/HUGGINGFACE_API_KEY
  
  - model_name: custom-endpoint
    litellm_params:
      model: huggingface/https://your-endpoint.cloud
      api_key: os.environ/HF_TOKEN

import openai

client = openai.OpenAI(
    api_key="sk-1234",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

Model Selection

Use Inference API for testing and prototyping
Use dedicated endpoints for production workloads
Check model availability on HuggingFace Hub

Performance

Dedicated endpoints provide better latency
Provider routing offers alternative inference options
Monitor staging vs production provider status

Cost Optimization

Inference API is free tier available
Dedicated endpoints are billed separately
Compare provider pricing when routing

Common Models

Model	Use Case
`meta-llama/Llama-3.3-70B-Instruct`	General chat
`mistralai/Mixtral-8x7B-Instruct-v0.1`	Advanced reasoning
`sentence-transformers/all-MiniLM-L6-v2`	Embeddings
`BAAI/bge-large-en-v1.5`	Search embeddings
`BAAI/bge-reranker-v2-m3`	Reranking

Providers

Provider Features

Overview

Quick Start

Deployment Options

Authentication

Chat Completions

Streaming

Embeddings

Reranking

Provider-Specific Routing

Configuration

Supported Parameters

Error Handling

LiteLLM Proxy

Best Practices

Common Models

Build docs developers (and LLMs) love

Providers

Provider Features

​Overview

​Quick Start

​Deployment Options

​Authentication

​Chat Completions

​Streaming

​Embeddings

​Reranking

​Provider-Specific Routing

​Configuration

​Supported Parameters

​Error Handling

​LiteLLM Proxy

​Best Practices

​Common Models

Build docs developers (and LLMs) love

Overview

Quick Start

Deployment Options

Authentication

Chat Completions

Streaming

Embeddings

Reranking

Provider-Specific Routing

Configuration

Supported Parameters

Error Handling

LiteLLM Proxy

Best Practices

Common Models