GroqLLM class provides an interface to Groq’s high-performance LLM API with built-in RAG (Retrieval-Augmented Generation) functionality. It retrieves relevant context from a vector store and generates informed responses.
Class definition
Constructor parameters
Name of the Groq model to use. Examples:
"llama-3.3-70b-versatile"(default in main.py)"mixtral-8x7b-32768""gemma-7b-it"
Controls randomness in generation (0.0 to 2.0).
- 0.0-0.3: More deterministic, factual responses (recommended for code questions)
- 0.4-0.7: Balanced creativity and consistency
- 0.8-2.0: More creative, varied responses
Maximum number of tokens in the generated response. Limits response length.
Requires the
GROQ_API_KEY environment variable to be set. The API key is automatically loaded from a .env file or can be set via os.environ.Methods
rag()
Performs Retrieval-Augmented Generation: retrieves relevant context and generates an answer.The user’s question or prompt to answer.
An initialized
RAGRetriever instance for retrieving relevant documents from the vector store.Number of relevant documents to retrieve and include as context.
The LLM’s generated response based on the retrieved context. Returns a fallback message if no relevant context is found.
Usage example
Integration example
Frommain.py showing the complete RAG setup:
Prompt structure
The RAG method uses the following prompt template:context is formatted as:
Customizing generation parameters
Context formatting
The retrieved documents are formatted with file paths for clarity:Handling no results
Error handling
Supported Groq models
Popular Groq models
Popular Groq models
- llama-3.3-70b-versatile: Balanced performance and quality (recommended)
- llama-3.1-70b-versatile: Previous generation Llama 3.1
- mixtral-8x7b-32768: Mixture of Experts, large context window
- gemma-7b-it: Efficient smaller model
- llama-3.1-8b-instant: Very fast, smaller model
API key management
Response processing
Performance considerations
- Groq provides very fast inference (often < 1 second for responses)
- Larger
top_kvalues increase context size and may slow generation - Context length is limited by model’s max tokens (varies by model)
- Consider
max_tokensparameter to control response length and cost
Implementation notes
- Uses LangChain’s
ChatGroqwrapper for API interactions - API key is validated during initialization (raises
ValueErrorif missing) - Temperature defaults to 0.1 for more deterministic, factual code responses
- The
rag()method handles the full RAG pipeline: retrieval → formatting → generation - Responses are extracted from the LLM’s completion via
.contentattribute - No conversation history is maintained (each query is independent)