Skip to main content

Overview

Groq provides ultra-fast inference for open-source LLMs like Llama, Mixtral, and Gemma with speeds up to 500+ tokens/second.

Installation

npm install @llamaindex/groq

Basic Usage

import { Groq } from "@llamaindex/groq";

const llm = new Groq({
  model: "llama-3.1-70b-versatile",
  apiKey: process.env.GROQ_API_KEY
});

const response = await llm.chat({
  messages: [
    { role: "user", content: "Explain quantum computing" }
  ]
});

console.log(response.message.content);

Constructor Options

model
string
required
Groq model name
apiKey
string
Groq API key (defaults to GROQ_API_KEY env variable)
temperature
number
Sampling temperature
maxTokens
number
Maximum tokens in response
topP
number
default:1
Nucleus sampling parameter

Supported Models

Llama 3.1

  • llama-3.1-405b-reasoning: Most capable
  • llama-3.1-70b-versatile: Balanced performance
  • llama-3.1-8b-instant: Fastest

Llama 3

  • llama3-70b-8192: 70B parameter model
  • llama3-8b-8192: 8B parameter model

Mixtral

  • mixtral-8x7b-32768: Mixtral MoE model

Gemma

  • gemma-7b-it: Google Gemma 7B
  • gemma2-9b-it: Gemma 2 9B

Streaming

const stream = await llm.chat({
  messages: [{ role: "user", content: "Write a story" }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.delta);
}

Function Calling

import { tool } from "@llamaindex/core/tools";
import { z } from "zod";

const weatherTool = tool({
  name: "get_weather",
  description: "Get weather for a location",
  parameters: z.object({
    location: z.string()
  }),
  execute: async ({ location }) => {
    return `Weather in ${location}: 72°F`;
  }
});

const response = await llm.chat({
  messages: [{ role: "user", content: "Weather in NYC?" }],
  tools: [weatherTool]
});

Structured Output

import { z } from "zod";

const schema = z.object({
  summary: z.string(),
  sentiment: z.enum(["positive", "negative", "neutral"]),
  topics: z.array(z.string())
});

const result = await llm.exec({
  messages: [{ role: "user", content: "Analyze: Great product, fast shipping!" }],
  responseFormat: schema
});

Configuration

Environment Variables

GROQ_API_KEY=gsk_...

Global Settings

import { Settings } from "llamaindex";
import { Groq } from "@llamaindex/groq";

Settings.llm = new Groq({
  model: "llama-3.1-70b-versatile"
});

Performance

Groq’s LPU (Language Processing Unit) delivers exceptional speed:
const startTime = Date.now();

const response = await llm.chat({
  messages: [{ role: "user", content: "Explain AI" }]
});

const duration = Date.now() - startTime;
console.log(`Response time: ${duration}ms`);
console.log(`Tokens/sec: ${response.raw.usage.completion_tokens / (duration / 1000)}`);
Typical speeds: 300-500 tokens/second

With LlamaIndex

import { Settings, VectorStoreIndex } from "llamaindex";
import { Groq } from "@llamaindex/groq";

Settings.llm = new Groq({ model: "llama-3.1-70b-versatile" });

const index = await VectorStoreIndex.fromDocuments(documents);
const queryEngine = index.asQueryEngine();

const response = await queryEngine.query({
  query: "What is the main topic?"
});

Model Selection Guide

Use CaseRecommended ModelWhy
Complex reasoningllama-3.1-405b-reasoningBest quality
General purposellama-3.1-70b-versatileBalanced
Speed criticalllama-3.1-8b-instantFastest
Long contextmixtral-8x7b-3276832K context

Rate Limits

Groq has generous free tier limits:
  • Free: 30 requests/minute
  • Paid: Higher limits based on plan
Handle rate limits:
try {
  const response = await llm.chat({ messages });
} catch (error) {
  if (error.status === 429) {
    console.log("Rate limit hit, waiting...");
    await new Promise(resolve => setTimeout(resolve, 2000));
    // Retry
  }
}

Best Practices

  1. Use for production: Groq’s speed excellent for real-time applications
  2. Choose right model: Balance speed vs capability
  3. Monitor usage: Track API calls and costs
  4. Stream responses: Even better UX with Groq’s speed
  5. Handle rate limits: Implement retry logic

See Also

Build docs developers (and LLMs) love