CacheOpenAI

Overview

CacheOpenAI is an OpenAI-compatible LLM client that implements response caching using SQLite. It supports both standard OpenAI API and Azure OpenAI, with automatic deduplication of identical requests to reduce API costs.

Class Definition

from remem.llm.openai_gpt import CacheOpenAI

Location: src/remem/llm/openai_gpt.py:120

Initialization

From Experiment Config

@classmethod
def from_experiment_config(cls, global_config: BaseConfig)

Create an instance from a global configuration object. Parameters:

global_config

BaseConfig

Global configuration containing LLM settings

Example:

from remem.utils.config_utils import BaseConfig
from remem.llm.openai_gpt import CacheOpenAI

config = BaseConfig()
llm = CacheOpenAI.from_experiment_config(config)

Location: src/remem/llm/openai_gpt.py:123

Direct Initialization

def __init__(
    self,
    cache_dir,
    cache_filename: str = None,
    llm_name: str = "gpt-4o-mini",
    api_key: str = None,
    llm_base_url: str = None,
    **kwargs,
) -> None

Parameters:

cache_dir

str

required

Directory where SQLite cache files will be stored

cache_filename

str

default:"None"

Custom cache filename. If None, defaults to {llm_name}_cache.sqlite

llm_name

str

default:"gpt-4o-mini"

Name of the OpenAI model to use (e.g., “gpt-4o”, “gpt-4o-mini”, “gpt-3.5-turbo”)

api_key

str

default:"None"

OpenAI API key. If None, reads from OPENAI_API_KEY environment variable

llm_base_url

str

default:"None"

Base URL for the OpenAI API endpoint

**kwargs

dict

Additional configuration options:

num_gen_choices (int): Number of completions to generate (default: 1)
seed (int): Random seed for reproducibility (default: 0)
temperature (float): Sampling temperature (default: 0.0)
use_azure (bool): Use Azure OpenAI instead of standard API (default: False)

Example:

llm = CacheOpenAI(
    cache_dir="./llm_cache",
    llm_name="gpt-4o-mini",
    api_key="sk-...",
    llm_base_url="https://api.openai.com/v1",
    temperature=0.7,
    seed=42
)

Location: src/remem/llm/openai_gpt.py:129

Core Methods

`infer`

@cache_response
def infer(self, messages: List[TextChatMessage], **kwargs) -> Tuple[List[TextChatMessage], dict]

Perform synchronous inference with automatic caching. Parameters:

messages

List[TextChatMessage]

required

List of chat messages. Each message is a dictionary with role and content keys

**kwargs

dict

Optional generation parameters that override defaults:

model (str): Override the model name
temperature (float): Override sampling temperature
seed (int): Override random seed
response_format (dict): Specify JSON output format
enable_thinking (bool): Enable thinking mode for Qwen3 models

Returns:

response

Tuple[str, dict, bool]

A tuple containing:

response_message (str): The LLM’s generated response text
metadata (dict): Contains:
- prompt: Original input messages
- response: Generated text
- prompt_tokens: Number of tokens in prompt
- completion_tokens: Number of tokens in completion
- finish_reason: Why generation stopped
cache_hit (bool): Whether the response was retrieved from cache

Example:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

response, metadata, cache_hit = llm.infer(messages)
print(f"Response: {response}")
print(f"Cache hit: {cache_hit}")
print(f"Tokens used: {metadata['prompt_tokens']} + {metadata['completion_tokens']}")

Location: src/remem/llm/openai_gpt.py:210

`batch_infer`

def batch_infer(
    self, messages_list: List[List[TextChatMessage]], max_workers: int = 10, **kwargs
) -> List[Tuple[List[TextChatMessage], dict, bool]]

Run inference on multiple inputs in parallel while preserving cache integrity. Parameters:

messages_list

List[List[TextChatMessage]]

required

A list of message sequences to send

max_workers

int

default:"10"

Number of threads to use for parallel processing

**kwargs

dict

Additional parameters passed through to infer()

Returns:

results

List[Tuple[str, dict, bool]]

A list of (response, metadata, cache_hit) tuples in the same order as input

Example:

messages_batch = [
    [{"role": "user", "content": "Translate 'hello' to French"}],
    [{"role": "user", "content": "Translate 'goodbye' to Spanish"}],
    [{"role": "user", "content": "Translate 'thank you' to German"}]
]

results = llm.batch_infer(messages_batch, max_workers=3)

for i, (response, metadata, cache_hit) in enumerate(results):
    print(f"Request {i}: {response} (cached: {cache_hit})")

Location: src/remem/llm/openai_gpt.py:247

Caching Mechanism

The @cache_response decorator automatically caches responses based on:

Input messages
Model name
Seed value
Temperature
Response format

Cache Key Generation:

key_data = {
    "messages": messages,
    "model": model,
    "seed": seed,
    "temperature": temperature,
    "response_format": response_format,
}
key_hash = hashlib.sha256(json.dumps(key_data, sort_keys=True).encode()).hexdigest()

Location: src/remem/llm/openai_gpt.py:23 Features:

SQLite database for persistent caching
File-based locking for concurrent access
Automatic stale lock cleanup
Cache hit tracking in response metadata

Azure OpenAI Support

To use Azure OpenAI, set the following environment variables and pass use_azure=True:

import os

os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
os.environ["OPENAI_API_VERSION"] = "2024-02-15-preview"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"

llm = CacheOpenAI(
    cache_dir="./cache",
    llm_name="gpt-4",
    use_azure=True
)

Location: src/remem/llm/openai_gpt.py:178

Special Model Support

Qwen3 Thinking Mode

For Qwen3 models, you can enable thinking mode:

response, metadata, _ = llm.infer(
    messages,
    enable_thinking=True
)

Location: src/remem/llm/openai_gpt.py:218

JSON Output Mode

Request JSON-formatted responses:

response, metadata, _ = llm.infer(
    messages,
    response_format={"type": "json_object"}
)

import json
data = json.loads(response)

The implementation automatically strips markdown code fences if present. Location: src/remem/llm/openai_gpt.py:238

Configuration Details

The _init_llm_config() method sets up default generation parameters:

{
    "llm_name": "gpt-4o-mini",
    "llm_base_url": "https://api.openai.com/v1",
    "generate_params": {
        "model": "gpt-4o-mini",
        "n": 1,
        "seed": 0,
        "temperature": 0.0
    }
}

Location: src/remem/llm/openai_gpt.py:196

Error Handling

The client includes automatic retry logic:

self.openai_client = OpenAI(
    base_url=self.llm_base_url,
    api_key=api_key,
    timeout=60,
    max_retries=5
)

Failed API calls are logged with error details. Location: src/remem/llm/openai_gpt.py:194

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

Overview

Class Definition

Initialization

From Experiment Config

Direct Initialization

Core Methods

`infer`

`batch_infer`

Caching Mechanism

Azure OpenAI Support

Special Model Support

Qwen3 Thinking Mode

JSON Output Mode

Configuration Details

Error Handling

See Also

Build docs developers (and LLMs) love

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

​Overview

​Class Definition

​Initialization

​From Experiment Config

​Direct Initialization

​Core Methods

​infer

​batch_infer

​Caching Mechanism

​Azure OpenAI Support

​Special Model Support

​Qwen3 Thinking Mode

​JSON Output Mode

​Configuration Details

​Error Handling

​See Also

Build docs developers (and LLMs) love

Overview

Class Definition

Initialization

From Experiment Config

Direct Initialization

Core Methods

`infer`

`batch_infer`

Caching Mechanism

Azure OpenAI Support

Special Model Support

Qwen3 Thinking Mode

JSON Output Mode

Configuration Details

Error Handling

See Also