Skip to main content

Overview

Indexing is the first step in using REMem. The index() method processes your documents, extracts structured information (entities, facts, and relationships), creates embeddings, and builds a knowledge graph for efficient retrieval.

Basic Usage

1

Initialize ReMem

Create a ReMem instance with your configuration:
from remem import ReMem
from remem.utils.config_utils import BaseConfig

config = BaseConfig(
    llm_name="gpt-4o-mini",
    embedding_model_name="nvidia/NV-Embed-v2",
    dataset="my_dataset"
)

remem = ReMem(global_config=config, working_dir="./remem_data")
2

Prepare Documents

Format your documents as a list of strings:
docs = [
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons.",
    "Deep learning uses multiple layers of neural networks."
]
3

Index Documents

Call the index() method to process and store your documents:
remem.index(docs)
This performs:
  • Document chunking (if needed)
  • Information extraction (entities, facts)
  • Embedding generation
  • Knowledge graph construction

Complete Example from main.py

Here’s a real-world example from the REMem codebase:
main.py
import json
from remem import ReMem
from remem.utils.config_utils import BaseConfig

# Load corpus
corpus_path = f"reproduce/dataset/musique_corpus.json"
with open(corpus_path, "r") as f:
    corpus = json.load(f)

# Format documents
docs = [f"{doc['title']}\n{doc['text']}" for doc in corpus]

# Configure ReMem
config = BaseConfig(
    llm_name="gpt-4o-mini",
    embedding_model_name="nvidia/NV-Embed-v2",
    dataset="musique",
    force_index_from_scratch=False,  # Reuse existing index if available
    retrieval_top_k=200,
    qa_top_k=5
)

remem = ReMem(global_config=config, working_dir="./outputs/musique")

# Index documents
remem.index(docs)

Extraction Methods

REMem supports multiple information extraction strategies via extract_method:

OpenIE (Default)

Extracts subject-predicate-object triples:
config = BaseConfig(
    extract_method="openie",  # Default
    llm_name="gpt-4o-mini"
)

Episodic

Optimized for conversational/temporal data:
config = BaseConfig(
    extract_method="episodic",
    llm_name="gpt-4o-mini"
)

Episodic Gist

Extracts gists, facts, entities, and metadata:
config = BaseConfig(
    extract_method="episodic_gist",
    llm_name="gpt-4o-mini"
)

Temporal

Captures time-sensitive relationships:
config = BaseConfig(
    extract_method="temporal",
    llm_name="gpt-4o-mini"
)

Incremental Indexing

By default, REMem supports incremental indexing. New documents are added to the existing graph:
# First batch
docs_batch_1 = ["Document 1", "Document 2"]
remem.index(docs_batch_1)

# Second batch (incremental)
docs_batch_2 = ["Document 3", "Document 4"]
remem.index(docs_batch_2)  # Adds to existing graph
Set force_index_from_scratch=True in BaseConfig to rebuild the entire index from scratch. This will ignore any existing embeddings and graph data.

Indexing Performance

Online vs Offline Mode

Control LLM inference mode with llm_infer_mode:
# Online mode (default) - real-time API calls
config = BaseConfig(
    llm_infer_mode="online",
    llm_name="gpt-4o-mini"
)

# Offline mode - batch processing with vLLM
config = BaseConfig(
    llm_infer_mode="offline",
    llm_name="meta-llama/Llama-3.3-70B-Instruct",
    vllm_tensor_parallel_size=2
)

Batch Sizes

Adjust embedding batch size for optimal performance:
config = BaseConfig(
    embedding_batch_size=16,  # Default: 16
    embedding_model_name="nvidia/NV-Embed-v2"
)

Chunking Configuration

Control document preprocessing with chunking parameters:
config = BaseConfig(
    preprocess_chunk_max_token_size=512,  # Max tokens per chunk
    preprocess_chunk_overlap_token_size=128,  # Overlap between chunks
    preprocess_encoder_name="gpt-4o"  # Tokenizer for chunking
)
If preprocess_chunk_max_token_size=None, documents are treated as single chunks without splitting.

Graph Construction Parameters

Customize graph building behavior:
config = BaseConfig(
    # Synonymy edge construction
    synonymy_edge_topk=2047,  # Number of neighbors for synonym detection
    synonymy_edge_sim_threshold=0.8,  # Similarity threshold for synonyms
    
    # Graph type
    is_directed_graph=False,  # Use undirected graph (default)
    graph_type="facts_and_sim_passage_node_unidirectional"
)

Saving OpenIE Results

Persist extracted information for analysis or reuse:
config = BaseConfig(
    save_openie=True,  # Default: saves extraction results
    force_openie_from_scratch=False  # Reuse existing OpenIE results
)

What Gets Created

After indexing, REMem creates the following in your working_dir:
  • chunk_embeddings/ - Embeddings for document chunks
  • entity_embeddings/ - Embeddings for extracted entities
  • fact_embeddings/ - Embeddings for extracted facts
  • graph.pkl - Knowledge graph structure
  • openie_results_*.json - Extracted entities and relationships (if save_openie=True)

Next Steps

Retrieval

Learn how to retrieve relevant passages

Configuration

Explore all configuration options

Build docs developers (and LLMs) love