The Groq Microservice does not implement rate limiting at the application level. Instead, it inherits rate limits directly from the underlying Groq API. Understanding these limits is crucial for production deployments.
Without application-level rate limiting, your microservice will pass through all requests to Groq’s API until you hit their rate limits, which can result in 502 errors for end users.
The Groq Microservice uses the Groq API (https://api.groq.com/openai/v1/chat/completions) which has its own rate limiting policies:
Current Model: The service uses llama-3.1-8b-instant by default (see server.js:37).Rate limits vary by model and account tier. Check your specific limits in the Groq Console.
import express from "express";import rateLimit from "express-rate-limit";import RedisStore from "rate-limit-redis";import Redis from "ioredis";import "dotenv/config";const app = express();app.use(express.json({ limit: "1mb" }));// Create Redis clientconst redisClient = new Redis({ host: process.env.REDIS_HOST || 'localhost', port: process.env.REDIS_PORT || 6379, password: process.env.REDIS_PASSWORD,});// Configure rate limiter with Redis storeconst limiter = rateLimit({ windowMs: 60 * 1000, // 1 minute max: 10, store: new RedisStore({ client: redisClient, prefix: 'groq_rl:', // Key prefix in Redis }), message: { error: "Límite de solicitudes excedido", details: "Por favor, espera antes de hacer más solicitudes" },});app.use('/api/', limiter);
Benefits:
Rate limits are shared across all service instances
Scales horizontally
Persistent rate limit data
Per-User Rate Limiting
Implement different rate limits for different users or API keys:Implementation:
import rateLimit from "express-rate-limit";// User tier configurationconst RATE_LIMITS = { free: { windowMs: 60 * 1000, max: 5 }, premium: { windowMs: 60 * 1000, max: 50 }, enterprise: { windowMs: 60 * 1000, max: 500 },};// Middleware to identify user tierfunction getUserTier(req) { const apiKey = req.headers['x-api-key']; // Look up user tier from database or config // This is a simplified example if (!apiKey) return 'free'; // Query your database or config to get actual tier return 'premium';}// Dynamic rate limiterconst dynamicLimiter = rateLimit({ windowMs: 60 * 1000, max: (req) => { const tier = getUserTier(req); return RATE_LIMITS[tier].max; }, keyGenerator: (req) => { // Use API key or IP address as key return req.headers['x-api-key'] || req.ip; }, message: (req) => { const tier = getUserTier(req); return { error: "Límite de solicitudes excedido", details: `Límite para plan ${tier}: ${RATE_LIMITS[tier].max} solicitudes por minuto`, tier: tier, upgradeUrl: "https://your-service.com/upgrade" }; },});app.use('/api/', dynamicLimiter);
Response Caching
Reduce API calls by caching responses for identical requests:Installation:
npm install node-cache
Implementation:
import NodeCache from "node-cache";import crypto from "crypto";// Create cache with 1 hour TTLconst cache = new NodeCache({ stdTTL: 3600, checkperiod: 600 });// Generate cache key from requestfunction getCacheKey(prompt, area) { const content = JSON.stringify({ prompt, area }); return crypto.createHash('md5').update(content).digest('hex');}app.post("/api/generate-template", async (req, res) => { try { const { prompt, area } = req.body || {}; if (!prompt || typeof prompt !== "string") { return res.status(400).json({ error: "prompt es requerido" }); } // Check cache const cacheKey = getCacheKey(prompt, area); const cachedResponse = cache.get(cacheKey); if (cachedResponse) { return res.json({ text: cachedResponse, cached: true // Optional: indicate response was cached }); } // ... existing Groq API call code ... const r = await fetch("https://api.groq.com/openai/v1/chat/completions", { // ... existing fetch config }); if (!r.ok) { const details = await r.text(); return res.status(502).json({ error: "Groq error", details }); } const data = await r.json(); const text = data?.choices?.[0]?.message?.content?.trim() || ""; // Store in cache cache.set(cacheKey, text); return res.json({ text }); } catch (e) { return res.status(500).json({ error: "server error", details: String(e) }); }});
Be cautious with caching LLM responses as they may vary slightly between calls even with the same prompt. Consider if this is appropriate for your use case.
Start with conservative rate limits and increase based on actual usage:
// Start with low limitsconst limiter = rateLimit({ windowMs: 60 * 1000, max: 5, // Only 5 requests per minute initially});// Monitor and adjust based on:// - Groq API limits for your account// - Actual user demand// - Error rates// - Response times
Provide Clear Error Messages
Help users understand rate limits:
const limiter = rateLimit({ windowMs: 60 * 1000, max: 10, message: { error: "Límite de solicitudes excedido", details: "Has alcanzado el límite de 10 solicitudes por minuto.", retryAfter: "Espera 60 segundos antes de intentar nuevamente", limit: 10, window: "1 minuto" }, standardHeaders: true, // Include rate limit headers});
Implement Graceful Degradation
Handle rate limit errors gracefully:
// Client-side handlingasync function generateTemplate(prompt, area) { try { const response = await fetch('/api/generate-template', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt, area }) }); if (response.status === 429) { const retryAfter = response.headers.get('Retry-After') || 60; // Queue the request for later return { queued: true, message: `Solicitud en cola. Se procesará en ${retryAfter} segundos.`, retryAfter }; } if (!response.ok) { throw new Error('API error'); } return await response.json(); } catch (error) { console.error('Request failed:', error); throw error; }}
For enterprise deployments with high volume requirements, consider reaching out to Groq about increasing your rate limits or implementing a dedicated infrastructure setup.