For organizations and locations, detects when extracted entities refer to the same real-world entity via:
Acronym matching (“DoD” ↔ “Department of Defense”)
Substring containment
Equivalence groups from config.yaml
From src/utils/quality_controls.py:127-215:
def _collapse_within_article_variants( *, entity_type: str, entities: List[Dict[str, Any]], domain: str = "guantanamo",) -> Tuple[List[Dict[str, Any]], int]: """Collapse name variants within a single article's extraction results. For organizations and locations, detects when two extracted entities likely refer to the same real-world entity (via acronym matching, substring containment, or configured equivalence groups) and merges them — keeping the more canonical name (proper nouns over descriptive phrases) and adding the other to an ``aliases`` list. """
Example:
Extracted: ["DoD", "Department of Defense", "Pentagon"]Collapsed: ["Department of Defense" {aliases: ["DoD", "Pentagon"]}]
class ExtractionQCReport(BaseModel): """Summary of what extraction QC did to a batch of entities.""" input_count: int = 0 dropped_missing_required: int = 0 deduped: int = 0 output_count: int = 0 flags: List[str] = Field(default_factory=list) fixes: Dict[str, Any] = Field(default_factory=dict)
From test coverage in tests/test_extraction_retry.py, the retry flow:
Initial extraction → QC detects severe issue
Build repair hint describing what went wrong
Retry extraction with augmented prompt including hint
QC again → If still severe, accept best attempt
Example repair hint:
Previous extraction had issues:- High drop rate: 8/12 entities missing required fields- Many duplicates: 6/12 entities were duplicatesPlease:- Ensure all required fields are populated- Avoid extracting the same entity multiple times with slight variations
Retry happens at most once to avoid infinite loops. After retry, even if QC still fails, the best available result is used.
def filter_entities_by_article_relevance( *, entity_type: str, entities: List[Dict[str, Any]], article_text: str, domain: str = "guantanamo", require_mention: bool = True,) -> Tuple[List[Dict[str, Any]], EntityRelevanceReport]: """Filter out entities whose names don't appear in the source article text. For each entity, builds a set of "needles" from its name, aliases, computed acronyms, and domain equivalence group variants. If none of those needles appear in the article text, the entity is dropped as likely hallucinated. """
for needle in needles: needle_lower = needle.lower() if len(needle_lower) <= 3: # Short needles: use word boundary to avoid matching inside words pattern = r"\b" + re.escape(needle_lower) + r"\b" if re.search(pattern, article_lower): found = True break else: if needle_lower in article_lower: found = True
Example: “CIA” must match as a full word, not inside “appreciation”.
def verify_profile_grounding( *, profile_text: str, article_texts: Dict[str, str], model_type: str = "gemini", max_article_chars: int = 12000, max_claim_chars: int = 600, min_grounding_score: float = 0.7,) -> GroundingReport: """Verify that profile claims are supported by their cited sources. Extracts citations from profile text, groups them by article, and uses an LLM call per article to verify each claim. Returns a GroundingReport with per-claim details and summary statistics. """
QC is invoked from src/engine/article_processor.py during extraction:
# After LLM extractionfor entity_type in ["people", "organizations", "locations", "events"]: entities, qc_report = run_extraction_qc( entity_type=entity_type, entities=raw_entities, domain=self.domain, ) # Check for severe issues if "high_drop_rate" in qc_report.flags or "zero_entities" in qc_report.flags: # Trigger retry with repair hint entities = self._retry_extraction(entity_type, article, qc_report)