Skip to main content

Overview

Groq provides blazing-fast LLM inference with support for popular open-source models. LiteLLM provides seamless integration with Groq’s API, supporting all major features including streaming, function calling, and reasoning models.

Quick Start

1

Install LiteLLM

pip install litellm
2

Set API Key

export GROQ_API_KEY="gsk_..."
3

Make Your First Call

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Supported Models

Meta’s Llama family on Groq’s infrastructure.
from litellm import completion

# Llama 3.3 70B - Best overall
response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# Llama 3.1 8B - Fast and efficient
response = completion(
    model="groq/llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Quick summary"}]
)

# Llama 4 405B - Most capable (if available)
response = completion(
    model="groq/llama-4-405b",
    messages=[{"role": "user", "content": "Complex analysis"}]
)

Authentication

export GROQ_API_KEY="gsk_..."
from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}]
)

Streaming

Groq excels at fast streaming responses.
from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a story about AI"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Function Calling

Groq supports OpenAI-compatible function calling.
from litellm import completion

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current stock price",
            "parameters": {
                "type": "object",
                "properties": {
                    "symbol": {
                        "type": "string",
                        "description": "Stock symbol, e.g. AAPL"
                    }
                },
                "required": ["symbol"]
            }
        }
    }
]

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's AAPL stock price?"}],
    tools=tools
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

JSON Mode

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "List 3 colors in JSON"}],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)

Reasoning Models

Groq supports reasoning effort for compatible models.
from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Solve this complex problem..."}],
    reasoning_effort="high"  # low, medium, high
)

# Access reasoning content
if response.choices[0].message.reasoning_content:
    print("Reasoning:", response.choices[0].message.reasoning_content)
    print("Answer:", response.choices[0].message.content)

Audio Transcription

Groq supports Whisper for audio transcription.
from litellm import transcription

with open("audio.mp3", "rb") as audio_file:
    response = transcription(
        model="groq/whisper-large-v3",
        file=audio_file,
        language="en"
    )
    
print(response.text)

Configuration

from litellm import completion

response = completion(
    model="groq/llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=1000,
    top_p=0.9,
    frequency_penalty=0.5,
    presence_penalty=0.5,
    stop=["STOP"]
)

Supported Parameters

ParameterTypeDescription
temperaturefloatRandomness (0-2)
max_tokensintMax output tokens
max_completion_tokensintAlternative to max_tokens
top_pfloatNucleus sampling
frequency_penaltyfloatReduce repetition (-2 to 2)
presence_penaltyfloatEncourage diversity (-2 to 2)
stoplist/strStop sequences
nintNumber of completions
response_formatdictJSON mode settings
reasoning_effortstrReasoning level (low/medium/high)

Error Handling

from litellm import completion
from litellm.exceptions import APIError, RateLimitError, Timeout

try:
    response = completion(
        model="groq/llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": "Hello!"}],
        timeout=30
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
except Timeout as e:
    print(f"Request timeout: {e}")
except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

LiteLLM Proxy

model_list:
  - model_name: llama-3.3-70b
    litellm_params:
      model: groq/llama-3.3-70b-versatile
      api_key: os.environ/GROQ_API_KEY
import openai

client = openai.OpenAI(
    api_key="sk-1234",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

  • Groq is optimized for speed - use streaming for best UX
  • Use smaller models (8B) for simple tasks
  • Use larger models (70B+) for complex reasoning
  • llama-3.3-70b-versatile for best overall performance
  • llama-3.1-8b-instant for fast, simple tasks
  • mixtral-8x7b-32768 for large context windows
  • Groq has generous rate limits but monitor usage
  • Implement exponential backoff for retries
  • Use LiteLLM’s built-in retry logic

Build docs developers (and LLMs) love