Skip to main content

Overview

The OpenIE class performs open information extraction by identifying named entities and extracting semantic triples (subject-predicate-object relationships) from text passages.

Class: OpenIE

Constructor

from remem.information_extraction.openie_openai import OpenIE
from remem.llm.openai_gpt import CacheOpenAI

llm_model = CacheOpenAI(model="gpt-4")
extractor = OpenIE(llm_model=llm_model)
llm_model
CacheOpenAI
required
The language model instance used for entity and triple extraction

Methods

batch_openie()

Conduct batch OpenIE synchronously using multi-threading for both NER and triple extraction.
def batch_openie(
    self,
    chunks: Dict[str, ChunkInfo]
) -> Tuple[Dict[str, NerRawOutput], Dict[str, TripleRawOutput]]

Parameters

chunks
Dict[str, ChunkInfo]
required
Dictionary of chunks to process. Each key is a chunk ID (hashed chunk), and each value contains:
  • num_tokens (int): Number of tokens in the chunk
  • content (str): The text content to extract from
  • chunk_order (List[Tuple]): Ordering information
  • full_doc_ids (List[str]): Associated document IDs

Returns

A tuple containing two dictionaries:
  1. NER Results (Dict[str, NerRawOutput]): Named entity recognition results
    • chunk_id (str): The chunk identifier
    • response (str): Raw LLM response
    • unique_entities (List[str]): List of unique entities found
    • metadata (dict): Token usage and cache hit information
  2. Triple Results (Dict[str, TripleRawOutput]): Extracted knowledge graph triples
    • chunk_id (str): The chunk identifier
    • response (str): Raw LLM response
    • triples (List[Tuple]): List of (subject, predicate, object) triples
    • metadata (dict): Token usage and cache hit information

Example Usage

from remem.information_extraction.openie_openai import OpenIE, ChunkInfo
from remem.llm.openai_gpt import CacheOpenAI

# Initialize the extractor
llm_model = CacheOpenAI(model="gpt-4")
extractor = OpenIE(llm_model=llm_model)

# Prepare chunks
chunks = {
    "chunk_1": {
        "num_tokens": 150,
        "content": "Albert Einstein was born in Ulm, Germany. He developed the theory of relativity.",
        "chunk_order": [(0, 1)],
        "full_doc_ids": ["doc_123"]
    }
}

# Extract entities and triples
ner_results, triple_results = extractor.batch_openie(chunks)

# Access results
for chunk_id, ner in ner_results.items():
    print(f"Entities in {chunk_id}: {ner.unique_entities}")
    # Output: ["Albert Einstein", "Ulm", "Germany", "theory of relativity"]

for chunk_id, triples in triple_results.items():
    print(f"Triples in {chunk_id}:")
    for triple in triples.triples:
        print(f"  {triple}")
    # Output:
    #   ("Albert Einstein", "was born in", "Ulm")
    #   ("Albert Einstein", "was born in", "Germany")
    #   ("Albert Einstein", "developed", "theory of relativity")

Extraction Process

The OpenIE extraction follows a two-stage pipeline:
  1. Named Entity Recognition (NER)
    • Uses the ner prompt template
    • Extracts unique named entities from each chunk
    • Deduplicates entities while preserving order
    • Returns entities as a list of strings
  2. Triple Extraction
    • Uses the triple_extraction prompt template
    • Takes extracted entities as input
    • Identifies relationships between entities
    • Returns structured (subject, predicate, object) triples
    • Filters invalid triples automatically

Performance Considerations

  • Uses ThreadPoolExecutor for parallel processing of multiple chunks
  • Displays progress bars for both NER and triple extraction phases
  • Tracks token usage and cache hits via metadata
  • Handles malformed JSON responses with automatic fixing
  • Continues processing even if individual chunks fail

Error Handling

The extractor includes robust error handling:
  • Catches and logs exceptions during NER or triple extraction
  • Returns empty results with error metadata when extraction fails
  • Stores error messages in the metadata field of output objects
  • Attempts to fix broken JSON responses due to length limits

Build docs developers (and LLMs) love