EntityExtractor

Overview

The EntityExtractor class provides a unified interface for extracting entities from text using either cloud-based (Gemini) or local (Ollama) models. It eliminates code duplication by supporting all four entity types through a single generic implementation.

Class Definition

from src.engine.extractors import EntityExtractor

extractor = EntityExtractor(entity_type: str, domain: str = "guantanamo")

Constructor Parameters

entity_type

str

required

The type of entity to extract. Supported values:

"people" - Person entities
"organizations" - Organization entities
"locations" - Location entities
"events" - Event entities

domain

str

default:"guantanamo"

The domain configuration to use. Determines which prompts and schemas are loaded.

Raises

ValueError - If entity_type is not one of the supported values

Methods

extract_cloud()

Extract entities using cloud-based models (Gemini via Google AI).

entities = extractor.extract_cloud(
    text: str,
    model: str = CLOUD_MODEL,
    temperature: float = 0,
    repair_hint: Optional[str] = None
) -> List[Dict[str, Any]]

text

str

required

The article text to extract entities from

model

str

default:"CLOUD_MODEL"

The cloud model to use (defaults to gemini-2.0-flash-exp)

temperature

float

default:"0"

Sampling temperature for generation (0 = deterministic)

repair_hint

Optional[str]

default:"None"

Optional suffix appended to the system prompt on retry attempts. Used by QC retry logic to guide the model toward fixing specific issues (e.g., “avoid generic names”).

return

List[Dict[str, Any]]

List of extracted entities as dictionaries with type-specific fields

Example

from src.engine.extractors import EntityExtractor

extractor = EntityExtractor("people", domain="guantanamo")

text = """
Major General Geoffrey Miller was appointed commander of 
Joint Task Force Guantanamo in November 2002.
"""

people = extractor.extract_cloud(
    text=text,
    model="gemini-2.0-flash-exp",
    temperature=0
)

# Returns:
# [
#   {
#     "name": "Geoffrey Miller",
#     "alternative_names": ["Major General Geoffrey Miller"],
#     "aliases": []
#   }
# ]

extract_local()

Extract entities using local models (Ollama with structured output via Instructor).

entities = extractor.extract_local(
    text: str,
    model: str = OLLAMA_MODEL,
    temperature: float = 0,
    repair_hint: Optional[str] = None
) -> List[Dict[str, Any]]

text

str

required

The article text to extract entities from

model

str

default:"OLLAMA_MODEL"

The local model to use (defaults to llama3.3:70b-instruct-q8_0)

temperature

float

default:"0"

Sampling temperature for generation (0 = deterministic)

repair_hint

Optional[str]

default:"None"

Optional suffix appended to the system prompt on retry. See extract_cloud() for details.

return

List[Dict[str, Any]]

List of extracted entities as dictionaries

Uses the same List[Entity] response model as cloud extraction so prompts (which teach bare JSON arrays) parse correctly in both modes.

extract()

Convenience method that routes to either extract_cloud() or extract_local() based on model type.

entities = extractor.extract(
    text: str,
    model_type: str = "gemini",
    model: str = None,
    temperature: float = 0
) -> List[Dict[str, Any]]

text

str

required

The article text to extract entities from

model_type

str

default:"gemini"

Either "gemini" for cloud or "ollama" for local

model

Optional[str]

default:"None"

Specific model to use. If None, uses the appropriate default for the model type.

temperature

float

default:"0"

Sampling temperature for generation

return

List[Dict[str, Any]]

List of extracted entities as dictionaries

Example

extractor = EntityExtractor("organizations", domain="guantanamo")

# Cloud extraction
orgs_cloud = extractor.extract(
    text=article_text,
    model_type="gemini"
)

# Local extraction
orgs_local = extractor.extract(
    text=article_text,
    model_type="ollama"
)

Entity Type Configuration

The extractor internally maps entity types to their Pydantic models and list attribute names:

ENTITY_MODEL_GETTERS = {
    "people": get_person_model,
    "organizations": get_organization_model,
    "locations": get_location_model,
    "events": get_event_model,
}

ENTITY_LIST_ATTRIBUTES = {
    "people": "people",
    "organizations": "organizations",
    "locations": "locations",
    "events": "events",
}

Models are loaded dynamically from configs/<domain>/types/ using the domain configuration system.

Convenience Factory Functions

For backward compatibility, entity-specific factory functions are provided:

from src.engine.extractors import (
    create_people_extractor,
    create_organizations_extractor,
    create_locations_extractor,
    create_events_extractor
)

people_extractor = create_people_extractor(domain="guantanamo")
org_extractor = create_organizations_extractor(domain="guantanamo")
loc_extractor = create_locations_extractor(domain="guantanamo")
event_extractor = create_events_extractor(domain="guantanamo")

Show Implementation Details

Each factory function simply returns EntityExtractor(entity_type, domain) with the appropriate type:

def create_people_extractor(domain: str = "guantanamo") -> EntityExtractor:
    return EntityExtractor("people", domain)

Integration with QC Retry

The repair_hint parameter enables single-attempt retry when extraction quality control detects severe issues:

# First attempt
raw_entities = extractor.extract_cloud(text=article_text)
cleaned, qc_report = run_extraction_qc("people", raw_entities, domain)

# Retry with repair hint if QC flags issues
if should_retry_extraction(qc_report.flags):
    hint = build_repair_hint("people", qc_report.flags)
    raw_entities_v2 = extractor.extract_cloud(
        text=article_text,
        repair_hint=hint  # Guides model to fix specific issues
    )

See ArticleProcessor for the full retry orchestration logic.

Source Location

~/workspace/source/src/engine/extractors.py

CLI

Engine

Utilities

EntityExtractor

Overview

Class Definition

Constructor Parameters

Raises

Methods

extract_cloud()

Example

extract_local()

extract()

Example

Entity Type Configuration

Convenience Factory Functions

Integration with QC Retry

Source Location

Build docs developers (and LLMs) love

CLI

Engine

Utilities

​Overview

​Class Definition

​Constructor Parameters

​Raises

​Methods

​extract_cloud()

​Example

​extract_local()

​extract()

​Example

​Entity Type Configuration

​Convenience Factory Functions

​Integration with QC Retry

​Source Location

Build docs developers (and LLMs) love

Overview

Class Definition

Constructor Parameters

Raises

Methods

extract_cloud()

Example

extract_local()

extract()

Example

Entity Type Configuration

Convenience Factory Functions

Integration with QC Retry

Source Location