Skip to main content

Overview

CacheOpenAI is an OpenAI-compatible LLM client that implements response caching using SQLite. It supports both standard OpenAI API and Azure OpenAI, with automatic deduplication of identical requests to reduce API costs.

Class Definition

from remem.llm.openai_gpt import CacheOpenAI
Location: src/remem/llm/openai_gpt.py:120

Initialization

From Experiment Config

@classmethod
def from_experiment_config(cls, global_config: BaseConfig)
Create an instance from a global configuration object. Parameters:
global_config
BaseConfig
Global configuration containing LLM settings
Example:
from remem.utils.config_utils import BaseConfig
from remem.llm.openai_gpt import CacheOpenAI

config = BaseConfig()
llm = CacheOpenAI.from_experiment_config(config)
Location: src/remem/llm/openai_gpt.py:123

Direct Initialization

def __init__(
    self,
    cache_dir,
    cache_filename: str = None,
    llm_name: str = "gpt-4o-mini",
    api_key: str = None,
    llm_base_url: str = None,
    **kwargs,
) -> None
Parameters:
cache_dir
str
required
Directory where SQLite cache files will be stored
cache_filename
str
default:"None"
Custom cache filename. If None, defaults to {llm_name}_cache.sqlite
llm_name
str
default:"gpt-4o-mini"
Name of the OpenAI model to use (e.g., “gpt-4o”, “gpt-4o-mini”, “gpt-3.5-turbo”)
api_key
str
default:"None"
OpenAI API key. If None, reads from OPENAI_API_KEY environment variable
llm_base_url
str
default:"None"
Base URL for the OpenAI API endpoint
**kwargs
dict
Additional configuration options:
  • num_gen_choices (int): Number of completions to generate (default: 1)
  • seed (int): Random seed for reproducibility (default: 0)
  • temperature (float): Sampling temperature (default: 0.0)
  • use_azure (bool): Use Azure OpenAI instead of standard API (default: False)
Example:
llm = CacheOpenAI(
    cache_dir="./llm_cache",
    llm_name="gpt-4o-mini",
    api_key="sk-...",
    llm_base_url="https://api.openai.com/v1",
    temperature=0.7,
    seed=42
)
Location: src/remem/llm/openai_gpt.py:129

Core Methods

infer

@cache_response
def infer(self, messages: List[TextChatMessage], **kwargs) -> Tuple[List[TextChatMessage], dict]
Perform synchronous inference with automatic caching. Parameters:
messages
List[TextChatMessage]
required
List of chat messages. Each message is a dictionary with role and content keys
**kwargs
dict
Optional generation parameters that override defaults:
  • model (str): Override the model name
  • temperature (float): Override sampling temperature
  • seed (int): Override random seed
  • response_format (dict): Specify JSON output format
  • enable_thinking (bool): Enable thinking mode for Qwen3 models
Returns:
response
Tuple[str, dict, bool]
A tuple containing:
  • response_message (str): The LLM’s generated response text
  • metadata (dict): Contains:
    • prompt: Original input messages
    • response: Generated text
    • prompt_tokens: Number of tokens in prompt
    • completion_tokens: Number of tokens in completion
    • finish_reason: Why generation stopped
  • cache_hit (bool): Whether the response was retrieved from cache
Example:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

response, metadata, cache_hit = llm.infer(messages)
print(f"Response: {response}")
print(f"Cache hit: {cache_hit}")
print(f"Tokens used: {metadata['prompt_tokens']} + {metadata['completion_tokens']}")
Location: src/remem/llm/openai_gpt.py:210

batch_infer

def batch_infer(
    self, messages_list: List[List[TextChatMessage]], max_workers: int = 10, **kwargs
) -> List[Tuple[List[TextChatMessage], dict, bool]]
Run inference on multiple inputs in parallel while preserving cache integrity. Parameters:
messages_list
List[List[TextChatMessage]]
required
A list of message sequences to send
max_workers
int
default:"10"
Number of threads to use for parallel processing
**kwargs
dict
Additional parameters passed through to infer()
Returns:
results
List[Tuple[str, dict, bool]]
A list of (response, metadata, cache_hit) tuples in the same order as input
Example:
messages_batch = [
    [{"role": "user", "content": "Translate 'hello' to French"}],
    [{"role": "user", "content": "Translate 'goodbye' to Spanish"}],
    [{"role": "user", "content": "Translate 'thank you' to German"}]
]

results = llm.batch_infer(messages_batch, max_workers=3)

for i, (response, metadata, cache_hit) in enumerate(results):
    print(f"Request {i}: {response} (cached: {cache_hit})")
Location: src/remem/llm/openai_gpt.py:247

Caching Mechanism

The @cache_response decorator automatically caches responses based on:
  • Input messages
  • Model name
  • Seed value
  • Temperature
  • Response format
Cache Key Generation:
key_data = {
    "messages": messages,
    "model": model,
    "seed": seed,
    "temperature": temperature,
    "response_format": response_format,
}
key_hash = hashlib.sha256(json.dumps(key_data, sort_keys=True).encode()).hexdigest()
Location: src/remem/llm/openai_gpt.py:23 Features:
  • SQLite database for persistent caching
  • File-based locking for concurrent access
  • Automatic stale lock cleanup
  • Cache hit tracking in response metadata

Azure OpenAI Support

To use Azure OpenAI, set the following environment variables and pass use_azure=True:
import os

os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
os.environ["OPENAI_API_VERSION"] = "2024-02-15-preview"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"

llm = CacheOpenAI(
    cache_dir="./cache",
    llm_name="gpt-4",
    use_azure=True
)
Location: src/remem/llm/openai_gpt.py:178

Special Model Support

Qwen3 Thinking Mode

For Qwen3 models, you can enable thinking mode:
response, metadata, _ = llm.infer(
    messages,
    enable_thinking=True
)
Location: src/remem/llm/openai_gpt.py:218

JSON Output Mode

Request JSON-formatted responses:
response, metadata, _ = llm.infer(
    messages,
    response_format={"type": "json_object"}
)

import json
data = json.loads(response)
The implementation automatically strips markdown code fences if present. Location: src/remem/llm/openai_gpt.py:238

Configuration Details

The _init_llm_config() method sets up default generation parameters:
{
    "llm_name": "gpt-4o-mini",
    "llm_base_url": "https://api.openai.com/v1",
    "generate_params": {
        "model": "gpt-4o-mini",
        "n": 1,
        "seed": 0,
        "temperature": 0.0
    }
}
Location: src/remem/llm/openai_gpt.py:196

Error Handling

The client includes automatic retry logic:
self.openai_client = OpenAI(
    base_url=self.llm_base_url,
    api_key=api_key,
    timeout=60,
    max_retries=5
)
Failed API calls are logged with error details. Location: src/remem/llm/openai_gpt.py:194

See Also

Build docs developers (and LLMs) love