Overview
CacheOpenAI is an OpenAI-compatible LLM client that implements response caching using SQLite. It supports both standard OpenAI API and Azure OpenAI, with automatic deduplication of identical requests to reduce API costs.
Class Definition
src/remem/llm/openai_gpt.py:120
Initialization
From Experiment Config
Global configuration containing LLM settings
src/remem/llm/openai_gpt.py:123
Direct Initialization
Directory where SQLite cache files will be stored
Custom cache filename. If None, defaults to
{llm_name}_cache.sqliteName of the OpenAI model to use (e.g., “gpt-4o”, “gpt-4o-mini”, “gpt-3.5-turbo”)
OpenAI API key. If None, reads from
OPENAI_API_KEY environment variableBase URL for the OpenAI API endpoint
Additional configuration options:
num_gen_choices(int): Number of completions to generate (default: 1)seed(int): Random seed for reproducibility (default: 0)temperature(float): Sampling temperature (default: 0.0)use_azure(bool): Use Azure OpenAI instead of standard API (default: False)
src/remem/llm/openai_gpt.py:129
Core Methods
infer
List of chat messages. Each message is a dictionary with
role and content keysOptional generation parameters that override defaults:
model(str): Override the model nametemperature(float): Override sampling temperatureseed(int): Override random seedresponse_format(dict): Specify JSON output formatenable_thinking(bool): Enable thinking mode for Qwen3 models
A tuple containing:
response_message(str): The LLM’s generated response textmetadata(dict): Contains:prompt: Original input messagesresponse: Generated textprompt_tokens: Number of tokens in promptcompletion_tokens: Number of tokens in completionfinish_reason: Why generation stopped
cache_hit(bool): Whether the response was retrieved from cache
src/remem/llm/openai_gpt.py:210
batch_infer
A list of message sequences to send
Number of threads to use for parallel processing
Additional parameters passed through to
infer()A list of (response, metadata, cache_hit) tuples in the same order as input
src/remem/llm/openai_gpt.py:247
Caching Mechanism
The@cache_response decorator automatically caches responses based on:
- Input messages
- Model name
- Seed value
- Temperature
- Response format
src/remem/llm/openai_gpt.py:23
Features:
- SQLite database for persistent caching
- File-based locking for concurrent access
- Automatic stale lock cleanup
- Cache hit tracking in response metadata
Azure OpenAI Support
To use Azure OpenAI, set the following environment variables and passuse_azure=True:
src/remem/llm/openai_gpt.py:178
Special Model Support
Qwen3 Thinking Mode
For Qwen3 models, you can enable thinking mode:src/remem/llm/openai_gpt.py:218
JSON Output Mode
Request JSON-formatted responses:src/remem/llm/openai_gpt.py:238
Configuration Details
The_init_llm_config() method sets up default generation parameters:
src/remem/llm/openai_gpt.py:196
Error Handling
The client includes automatic retry logic:src/remem/llm/openai_gpt.py:194
See Also
- BaseLLM Interface - Base class documentation
- vLLM Offline Client - Offline inference alternative
- Configuration - BaseConfig documentation