Overview
Groq provides blazing-fast LLM inference with support for popular open-source models. LiteLLM provides seamless integration with Groq’s API, supporting all major features including streaming, function calling, and reasoning models.Quick Start
Supported Models
- Llama Models
- Mixtral Models
- Gemma Models
Meta’s Llama family on Groq’s infrastructure.
Authentication
- Environment Variable
- Direct Parameter
- Custom Base URL
Streaming
Groq excels at fast streaming responses.Function Calling
Groq supports OpenAI-compatible function calling.JSON Mode
- JSON Object
- JSON Schema
Reasoning Models
Groq supports reasoning effort for compatible models.Audio Transcription
Groq supports Whisper for audio transcription.Configuration
Supported Parameters
| Parameter | Type | Description |
|---|---|---|
temperature | float | Randomness (0-2) |
max_tokens | int | Max output tokens |
max_completion_tokens | int | Alternative to max_tokens |
top_p | float | Nucleus sampling |
frequency_penalty | float | Reduce repetition (-2 to 2) |
presence_penalty | float | Encourage diversity (-2 to 2) |
stop | list/str | Stop sequences |
n | int | Number of completions |
response_format | dict | JSON mode settings |
reasoning_effort | str | Reasoning level (low/medium/high) |
Error Handling
LiteLLM Proxy
Best Practices
Speed Optimization
Speed Optimization
- Groq is optimized for speed - use streaming for best UX
- Use smaller models (8B) for simple tasks
- Use larger models (70B+) for complex reasoning
Model Selection
Model Selection
llama-3.3-70b-versatilefor best overall performancellama-3.1-8b-instantfor fast, simple tasksmixtral-8x7b-32768for large context windows
Rate Limits
Rate Limits
- Groq has generous rate limits but monitor usage
- Implement exponential backoff for retries
- Use LiteLLM’s built-in retry logic