Overview
Groq provides lightning-fast LLM inference using their custom Language Processing Unit (LPU) technology, delivering speeds of 500+ tokens per second. Perfect for applications requiring ultra-low latency responses with popular open-source models. Base URL:https://api.groq.com/openai/v1
Supported Features
- ✅ Chat Completions
- ✅ Streaming (extremely fast)
- ✅ Function Calling
- ✅ Vision (select models)
- ✅ JSON Mode
- ❌ Embeddings
- ❌ Image Generation
- ❌ Fine-tuning
Quick Start
Chat Completions
Ultra-Fast Streaming
Available Models
Meta Llama
| Model | Context | Speed | Description |
|---|---|---|---|
llama-3.3-70b-versatile | 128K | Ultra-fast | Latest Llama 3.3 |
llama-3.1-70b-versatile | 128K | Ultra-fast | Llama 3.1 70B |
llama-3.1-8b-instant | 128K | Instant | Fastest Llama |
llama-3.2-90b-vision-preview | 128K | Fast | Vision-enabled |
llama-3.2-11b-vision-preview | 128K | Very fast | Smaller vision |
Mixtral
| Model | Context | Speed | Description |
|---|---|---|---|
mixtral-8x7b-32768 | 32K | Ultra-fast | Efficient MoE |
Google Gemma
| Model | Context | Speed | Description |
|---|---|---|---|
gemma2-9b-it | 8K | Very fast | Gemma 2 9B |
gemma-7b-it | 8K | Very fast | Gemma 7B |
Other Models
| Model | Context | Description |
|---|---|---|
llama-guard-3-8b | 8K | Content moderation |
llama3-groq-70b-8192-tool-use-preview | 8K | Tool use optimized |
Groq excels at:
- Ultra-low latency - 500+ tokens/second
- Streaming speed - Nearly instant response start
- Consistent performance - Predictable latency
- Real-time applications - Chat, assistants, games
- High throughput - Handle many concurrent requests
Configuration Options
| Header | Description | Required |
|---|---|---|
Authorization | Groq API key | Yes |
Advanced Features
Function Calling
Vision (Multimodal)
JSON Mode
Temperature Control
Max Tokens Control
Speed Comparison
Fallback Configuration
Use Groq first for speed, fallback to others:Load Balancing
Balance across Groq models:Error Handling
Best Practices
- Leverage speed - Build real-time features
- Use streaming - Take advantage of instant response start
- Enable function calling - Fast tool use
- Use 8B for simple tasks - Instant responses
- Use 70B for complex tasks - Still very fast
- Implement rate limit handling - Free tier has limits
- Monitor latency - Groq provides latency metrics
- Cache when possible - Even faster responses
Use Cases
Real-time Chat
Code Completion
Gaming NPCs
Rate Limits
Free Tier:- 30 requests per minute
- 14,400 requests per day
- Generous for development
- Higher rate limits
- Priority access
- Contact Groq for details
LPU Technology
Groq’s Language Processing Unit (LPU) provides:- Deterministic performance - Consistent latency
- Low latency - Less than 1 second for most requests
- High throughput - 500+ tokens/second
- Energy efficient - Lower power consumption
- Scalable - Handle large workloads
Pricing
Groq offers very competitive pricing:Groq Pricing
View detailed pricing for all Groq models
Getting Started
- Sign up at Groq Console
- Get your API key
- Start with free tier
- Experience the speed!
Related Resources
Together AI
Alternative open models
Anyscale
Another fast inference option
Streaming
Optimize streaming responses
Real-time Apps
Build real-time applications