Skip to main content

Overview

Groq provides lightning-fast LLM inference using their custom Language Processing Unit (LPU) technology, delivering speeds of 500+ tokens per second. Perfect for applications requiring ultra-low latency responses with popular open-source models. Base URL: https://api.groq.com/openai/v1

Supported Features

  • ✅ Chat Completions
  • ✅ Streaming (extremely fast)
  • ✅ Function Calling
  • ✅ Vision (select models)
  • ✅ JSON Mode
  • ❌ Embeddings
  • ❌ Image Generation
  • ❌ Fine-tuning

Quick Start

Chat Completions

from portkey_ai import Portkey

client = Portkey(
    provider="groq",
    Authorization="***"  # Your Groq API key
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Explain Groq's LPU technology"}
    ]
)

print(response.choices[0].message.content)

Ultra-Fast Streaming

import time

start = time.time()
stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Count from 1 to 100"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

end = time.time()
print(f"\n\nCompleted in {end - start:.2f} seconds")
# Often completes in under 2 seconds!

Available Models

Meta Llama

ModelContextSpeedDescription
llama-3.3-70b-versatile128KUltra-fastLatest Llama 3.3
llama-3.1-70b-versatile128KUltra-fastLlama 3.1 70B
llama-3.1-8b-instant128KInstantFastest Llama
llama-3.2-90b-vision-preview128KFastVision-enabled
llama-3.2-11b-vision-preview128KVery fastSmaller vision

Mixtral

ModelContextSpeedDescription
mixtral-8x7b-3276832KUltra-fastEfficient MoE

Google Gemma

ModelContextSpeedDescription
gemma2-9b-it8KVery fastGemma 2 9B
gemma-7b-it8KVery fastGemma 7B

Other Models

ModelContextDescription
llama-guard-3-8b8KContent moderation
llama3-groq-70b-8192-tool-use-preview8KTool use optimized
Groq excels at:
  • Ultra-low latency - 500+ tokens/second
  • Streaming speed - Nearly instant response start
  • Consistent performance - Predictable latency
  • Real-time applications - Chat, assistants, games
  • High throughput - Handle many concurrent requests

Configuration Options

client = Portkey(
    provider="groq",
    Authorization="***"  # Bearer token
)
HeaderDescriptionRequired
AuthorizationGroq API keyYes

Advanced Features

Function Calling

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Get the current time",
            "parameters": {
                "type": "object",
                "properties": {
                    "timezone": {
                        "type": "string",
                        "description": "Timezone name"
                    }
                },
                "required": ["timezone"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What time is it in Tokyo?"}],
    tools=tools
)

Vision (Multimodal)

response = client.chat.completions.create(
    model="llama-3.2-90b-vision-preview",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.jpg"
                }
            }
        ]
    }]
)

JSON Mode

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{
        "role": "user",
        "content": "List 5 programming languages with their release years"
    }],
    response_format={"type": "json_object"}
)

import json
result = json.loads(response.choices[0].message.content)
print(result)

Temperature Control

# More deterministic (good for factual tasks)
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# More creative
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a creative story"}],
    temperature=1.0
)

Max Tokens Control

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain quantum physics"}],
    max_tokens=500  # Limit response length
)

Speed Comparison

import time

def benchmark_provider(provider, model, prompt):
    client = Portkey(provider=provider, Authorization="***")
    
    start = time.time()
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    end = time.time()
    
    return end - start

# Groq is typically 5-10x faster
groq_time = benchmark_provider("groq", "llama-3.3-70b-versatile", "Write a haiku")
print(f"Groq: {groq_time:.2f}s")

Fallback Configuration

Use Groq first for speed, fallback to others:
config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {
            "provider": "groq",
            "api_key": "***",
            "override_params": {"model": "llama-3.3-70b-versatile"}
        },
        {
            "provider": "together-ai",
            "api_key": "***",
            "override_params": {"model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"}
        }
    ]
}

client = Portkey().with_options(config=config)

Load Balancing

Balance across Groq models:
config = {
    "strategy": {"mode": "loadbalance"},
    "targets": [
        {
            "provider": "groq",
            "api_key": "***",
            "override_params": {"model": "llama-3.3-70b-versatile"},
            "weight": 0.7
        },
        {
            "provider": "groq",
            "api_key": "***",
            "override_params": {"model": "llama-3.1-8b-instant"},
            "weight": 0.3
        }
    ]
}

client = Portkey().with_options(config=config)

Error Handling

from portkey_ai.exceptions import (
    RateLimitError,
    APIError,
    AuthenticationError
)

try:
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit: {e}")
    # Groq has generous rate limits but they exist
except AuthenticationError as e:
    print(f"Invalid API key: {e}")
except APIError as e:
    print(f"API error: {e}")

Best Practices

  1. Leverage speed - Build real-time features
  2. Use streaming - Take advantage of instant response start
  3. Enable function calling - Fast tool use
  4. Use 8B for simple tasks - Instant responses
  5. Use 70B for complex tasks - Still very fast
  6. Implement rate limit handling - Free tier has limits
  7. Monitor latency - Groq provides latency metrics
  8. Cache when possible - Even faster responses

Use Cases

Real-time Chat

# Ultra-responsive chat experience
stream = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=conversation_history,
    stream=True
)

Code Completion

# Near-instant code suggestions
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": f"Complete this code: {code_snippet}"}],
    max_tokens=200
)

Gaming NPCs

# Real-time NPC responses
response = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": f"NPC reaction to: {player_action}"}],
    temperature=0.8
)

Rate Limits

Free Tier:
  • 30 requests per minute
  • 14,400 requests per day
  • Generous for development
Paid Tiers:
  • Higher rate limits
  • Priority access
  • Contact Groq for details

LPU Technology

Groq’s Language Processing Unit (LPU) provides:
  • Deterministic performance - Consistent latency
  • Low latency - Less than 1 second for most requests
  • High throughput - 500+ tokens/second
  • Energy efficient - Lower power consumption
  • Scalable - Handle large workloads

Pricing

Groq offers very competitive pricing:

Groq Pricing

View detailed pricing for all Groq models

Getting Started

  1. Sign up at Groq Console
  2. Get your API key
  3. Start with free tier
  4. Experience the speed!

Together AI

Alternative open models

Anyscale

Another fast inference option

Streaming

Optimize streaming responses

Real-time Apps

Build real-time applications

Build docs developers (and LLMs) love