BaseConfig - REMem

Overview

The BaseConfig dataclass is the central configuration object for the ReMem framework. It controls all aspects of the system including LLM settings, embedding models, graph construction, retrieval, and evaluation parameters.

Constructor

@dataclass
class BaseConfig:
    # All parameters have sensible defaults

All parameters are optional with default values. You can override any subset of parameters.

Example

from remem.utils.config_utils import BaseConfig

# Use all defaults
config = BaseConfig()

# Override specific parameters
config = BaseConfig(
    llm_name="gpt-4o",
    embedding_model_name="nvidia/NV-Embed-v2",
    retrieval_top_k=20,
    qa_top_k=5
)

LLM Parameters

Configuration for language model behavior and API settings.

llm_name

str

default:"\"gpt-4o-mini\""

Name of the language model to use for general inference.

config = BaseConfig(llm_name="gpt-4o")

extract_llm_label

str

default:"None"

Label for the LLM used in information extraction. Defaults to llm_name if not specified.

qa_llm_label

str

default:"None"

Label for the LLM used in question answering. Defaults to llm_name if not specified.

llm_base_url

str

default:"None"

Base URL for the LLM API. If None, uses the default OpenAI service.

config = BaseConfig(
    llm_name="llama-3-70b",
    llm_base_url="http://localhost:8000/v1"
)

max_new_tokens

int

default:"2048"

Maximum number of new tokens to generate in each inference call.

num_gen_choices

int

default:"1"

Number of chat completion choices to generate for each input message.

seed

Optional[int]

default:"None"

Random seed for reproducibility.

temperature

float

default:"0"

Sampling temperature for LLM generation. 0 means deterministic.

extract_format

Literal["json_object", "json_schema"]

default:"None"

Response format specification for extraction tasks.

use_azure

bool

default:"False"

Whether to use Azure OpenAI service instead of standard OpenAI.

max_num_seqs

int

default:"256"

Maximum number of sequences to generate for vLLM offline mode.

max_model_len

int

default:"4096"

Maximum context length (in tokens) for the model.

max_retries

int

default:"10"

Maximum number of retry attempts for asynchronous API calls.

Storage & Indexing Parameters

force_openie_from_scratch

bool

default:"False"

If True, ignores existing OpenIE results and rebuilds from scratch.

force_index_from_scratch

bool

default:"False"

If True, ignores all existing storage files and graph data and rebuilds from scratch.

Setting this to True will delete all previously indexed data and embeddings.

save_openie

bool

default:"True"

Whether to save OpenIE extraction results to disk.

save_dir

Optional[str]

default:"None"

Directory to save all related files. If not specified:

For dataset-specific runs: outputs/{dataset}/
For general use: outputs/

config = BaseConfig(save_dir="/path/to/my/output")

Text Preprocessing Parameters

text_preprocessor_class_name

str

default:"\"TextPreprocessor\""

Name of the text preprocessor class to use.

preprocess_encoder_name

str

default:"\"gpt-4o\""

Name of the encoder for tokenization during document chunking.

preprocess_chunk_overlap_token_size

int

default:"128"

Number of overlapping tokens between consecutive chunks.

preprocess_chunk_max_token_size

Optional[int]

default:"None"

Maximum token size for each chunk. If None, treats the entire document as a single chunk.

config = BaseConfig(
    preprocess_chunk_max_token_size=512,
    preprocess_chunk_overlap_token_size=64
)

preprocess_chunk_func

str

default:"\"by_token\""

Chunking function to use for document preprocessing.

Information Extraction Parameters

extract_method

Literal["openie", "episodic", "episodic_gist", "temporal"]

default:"\"openie\""

Information extraction method:

"openie": Standard open information extraction (entities and triples)
"episodic": Episodic memory extraction for conversations
"episodic_gist": Episodic extraction with gist summaries
"temporal": Temporal event extraction

# For conversation data
config = BaseConfig(extract_method="episodic_gist")

# For standard documents
config = BaseConfig(extract_method="openie")

llm_infer_mode

Literal["offline", "online"]

default:"\"online\""

"online": Use API-based LLM calls
"offline": Use vLLM for local batch inference

skip_graph

bool

default:"False"

Whether to skip graph construction. Set to True when running vLLM offline indexing for the first time.

vllm_tensor_parallel_size

int

default:"2"

Tensor parallel size for vLLM offline mode.

Embedding Parameters

embedding_model_name

str

default:"\"nvidia/NV-Embed-v2\""

Name of the embedding model to use.

config = BaseConfig(embedding_model_name="text-embedding-3-large")

embedding_batch_size

int

default:"16"

Batch size for embedding model calls.

embedding_return_as_normalized

bool

default:"True"

Whether to normalize encoded embeddings.

embedding_max_seq_len

int

default:"2048"

Maximum sequence length for the embedding model.

Graph Construction Parameters

concatenate_gists_per_chunk

bool

default:"False"

For episodic_gist method:

False: Each gist becomes a separate node
True: All gists in a chunk are joined into one node

split_verbatim_per_chunk

bool

default:"True"

For conversation data:

True: Split multi-message conversations into individual message nodes
False: Keep each verbatim chunk as a single node

synonymy_edge_topk

int

default:"2047"

Number of nearest neighbors (k) for KNN retrieval when building synonymy edges between entities.

synonymy_edge_query_batch_size

int

default:"1000"

Batch size for query embeddings during synonymy edge construction.

synonymy_edge_key_batch_size

int

default:"10000"

Batch size for key embeddings during synonymy edge construction.

synonymy_edge_sim_threshold

float

default:"0.8"

Similarity threshold (0-1) for including candidate synonymy edges.

is_directed_graph

bool

default:"False"

Whether to construct a directed or undirected knowledge graph.

graph_type

Literal["dpr_only", "facts_and_sim", "facts_and_sim_passage_node_unidirectional"]

default:"\"facts_and_sim_passage_node_unidirectional\""

Type of graph to construct:

"dpr_only": Dense passage retrieval only (no graph)
"facts_and_sim": Graph with facts and similarity edges
"facts_and_sim_passage_node_unidirectional": Facts, similarities, and unidirectional passage edges

Retrieval Parameters

linking_top_k

int

default:"5"

Number of linked nodes to consider at each retrieval step.

retrieval_top_k

int

default:"200"

Number of documents to retrieve for each query.

config = BaseConfig(
    retrieval_top_k=100,  # Retrieve 100 documents
    qa_top_k=5            # Use top 5 for QA
)

damping

float

default:"0.5"

Damping factor for Personalized PageRank algorithm.

passage_node_weight

float

default:"0.05"

Multiplicative weight factor for passage nodes in PageRank.

rerank_dspy_file_path

Optional[str]

default:"None"

Path to a DSPy reranker model file for fact filtering.

Question Answering Parameters

qa_top_k

int

default:"5"

Number of top-ranked documents to feed to the QA model.

qa_passage_prefix

str

default:"\"Wikipedia Title: \""

Prefix to add before each passage in the QA context.

qa_prompt_template

Optional[str]

default:"None"

Name of the prompt template to use for QA tasks.

qa_reader

Literal["remem", "tiser"]

default:"\"remem\""

QA reader implementation to use.

Agent Parameters

For episodic and temporal extraction methods with agentic reasoning.

agent_fixed_tools

bool

default:"False"

True: Agent uses only semantic_retrieve + output_answer
False: Agent can select from full toolset

agent_max_steps

int

default:"5"

Maximum reasoning steps for the agent. For fixed_tools mode:

1 = semantic_retrieve only
2 = semantic_retrieve + output_answer

agent_fixed_retrieval_tool

str

default:"\"semantic_retrieve\""

Which retrieval tool to use in fixed_tools mode: "semantic_retrieve" or "lexical_retrieve"

Evaluation Parameters

do_eval_retrieval

bool

default:"True"

Whether to perform evaluation on retrieval results.

do_eval_qa

bool

default:"True"

Whether to perform evaluation on QA results.

Evaluation requires gold-standard data (gold_docs for retrieval, gold_answers for QA).

Dataset Parameters

dataset

Optional[str]

default:"None"

Name of the dataset being used. If specified, customizes the save directory and potentially the prompt templates.

config = BaseConfig(dataset="musique")
# save_dir will be: outputs/musique/

corpus_len

Optional[int]

default:"None"

Length of the corpus to use (for testing with subsets).

Complete Configuration Example

from remem.utils.config_utils import BaseConfig

# Production configuration for conversation data
config = BaseConfig(
    # LLM settings
    llm_name="gpt-4o",
    temperature=0,
    max_new_tokens=2048,
    
    # Extraction settings
    extract_method="episodic_gist",
    llm_infer_mode="online",
    
    # Embedding settings
    embedding_model_name="nvidia/NV-Embed-v2",
    embedding_batch_size=32,
    
    # Preprocessing
    preprocess_chunk_max_token_size=512,
    preprocess_chunk_overlap_token_size=64,
    
    # Graph construction
    synonymy_edge_sim_threshold=0.85,
    is_directed_graph=False,
    concatenate_gists_per_chunk=True,
    split_verbatim_per_chunk=True,
    
    # Retrieval
    retrieval_top_k=100,
    linking_top_k=10,
    damping=0.5,
    passage_node_weight=0.05,
    
    # QA
    qa_top_k=5,
    qa_passage_prefix="Context: ",
    
    # Agent (for episodic methods)
    agent_fixed_tools=False,
    agent_max_steps=5,
    
    # Evaluation
    do_eval_retrieval=True,
    do_eval_qa=True,
    
    # Storage
    save_dir="outputs/my_experiment",
    force_index_from_scratch=False
)

print(config.save_dir)  # outputs/my_experiment

Configuration Best Practices

For Document QA

BaseConfig(
    extract_method="openie",
    preprocess_chunk_max_token_size=512,
    retrieval_top_k=200,
    qa_top_k=5
)

For Conversations

BaseConfig(
    extract_method="episodic_gist",
    split_verbatim_per_chunk=True,
    concatenate_gists_per_chunk=True
)

For Fast Experimentation

BaseConfig(
    llm_name="gpt-4o-mini",
    embedding_batch_size=64,
    retrieval_top_k=50
)

For Production

BaseConfig(
    llm_name="gpt-4o",
    max_retries=10,
    do_eval_retrieval=True,
    do_eval_qa=True,
    save_openie=True
)

Core API

Information Extraction

RAG Strategies

Embeddings

LLM Backends

Evaluation

​Overview

​Constructor

​Example

​LLM Parameters

​llm_name

​extract_llm_label

​qa_llm_label

​llm_base_url

​max_new_tokens

​num_gen_choices

​seed

​temperature

​extract_format

​use_azure

​max_num_seqs

​max_model_len

​max_retries

​Storage & Indexing Parameters

​force_openie_from_scratch

​force_index_from_scratch

​save_openie

​save_dir

​Text Preprocessing Parameters

​text_preprocessor_class_name

​preprocess_encoder_name

​preprocess_chunk_overlap_token_size

​preprocess_chunk_max_token_size

​preprocess_chunk_func

​Information Extraction Parameters

​extract_method

​llm_infer_mode

​skip_graph

​vllm_tensor_parallel_size

​Embedding Parameters

​embedding_model_name

​embedding_batch_size

​embedding_return_as_normalized

​embedding_max_seq_len

​Graph Construction Parameters

​concatenate_gists_per_chunk

​split_verbatim_per_chunk

​synonymy_edge_topk

​synonymy_edge_query_batch_size

​synonymy_edge_key_batch_size

​synonymy_edge_sim_threshold

​is_directed_graph

​graph_type

​Retrieval Parameters

​linking_top_k

​retrieval_top_k

​damping

​passage_node_weight

​rerank_dspy_file_path

​Question Answering Parameters

​qa_top_k

​qa_passage_prefix

​qa_prompt_template

​qa_reader

​Agent Parameters

​agent_fixed_tools

​agent_max_steps

​agent_fixed_retrieval_tool

​Evaluation Parameters

​do_eval_retrieval

​do_eval_qa

​Dataset Parameters

​dataset

​corpus_len

​Complete Configuration Example

​Configuration Best Practices

For Document QA

For Conversations

For Fast Experimentation

For Production

Overview

Constructor

Example

LLM Parameters

llm_name

extract_llm_label

qa_llm_label

llm_base_url

max_new_tokens

num_gen_choices

seed

temperature

extract_format

use_azure

max_num_seqs

max_model_len

max_retries

Storage & Indexing Parameters

force_openie_from_scratch

force_index_from_scratch

save_openie

save_dir

Text Preprocessing Parameters

text_preprocessor_class_name

preprocess_encoder_name

preprocess_chunk_overlap_token_size

preprocess_chunk_max_token_size

preprocess_chunk_func

Information Extraction Parameters

extract_method

llm_infer_mode

skip_graph

vllm_tensor_parallel_size

Embedding Parameters

embedding_model_name

embedding_batch_size

embedding_return_as_normalized

embedding_max_seq_len

Graph Construction Parameters

concatenate_gists_per_chunk

split_verbatim_per_chunk

synonymy_edge_topk

synonymy_edge_query_batch_size

synonymy_edge_key_batch_size

synonymy_edge_sim_threshold

is_directed_graph

graph_type

Retrieval Parameters

linking_top_k

retrieval_top_k

damping

passage_node_weight

rerank_dspy_file_path

Question Answering Parameters

qa_top_k

qa_passage_prefix

qa_prompt_template

qa_reader

Agent Parameters

agent_fixed_tools

agent_max_steps

agent_fixed_retrieval_tool

Evaluation Parameters

do_eval_retrieval

do_eval_qa

Dataset Parameters

dataset

corpus_len

Complete Configuration Example

Configuration Best Practices