Skip to main content

Overview

The quality controls module provides deterministic validation and filtering functions that run after LLM generation. These checks enforce required fields, normalize names, detect duplicates, flag suspicious outputs, and verify profile grounding. Key Principles:
  • Deterministic: All checks are rule-based (no LLM calls)
  • Non-destructive: QC functions never raise exceptions, only drop invalid items and report
  • Actionable: Reports include flags and fix descriptions for debugging
Source: src/utils/quality_controls.py

Extraction QC

run_extraction_qc

from src.utils.quality_controls import run_extraction_qc

cleaned, report = run_extraction_qc(
    entity_type="people",
    entities=[{"name": "Abdul Rahman Ahmed", "nationality": "Yemeni"}],
    domain="guantanamo",
    min_name_len=2
)
Run deterministic QC on a batch of extracted entities.
entity_type
str
required
Entity type: "people", "organizations", "locations", or "events".
entities
List[Dict[str, Any]]
required
List of entity dictionaries from LLM extraction.
domain
str
default:"guantanamo"
Domain name for loading schema and equivalence groups.
min_name_len
int
default:"2"
Minimum name length. Names shorter than this are flagged (but not dropped).
cleaned_entities
List[Dict[str, Any]]
Validated and deduplicated entities.
report
ExtractionQCReport
Summary of QC actions (see below).

ExtractionQCReport

class ExtractionQCReport(BaseModel):
    input_count: int
    dropped_missing_required: int
    deduped: int
    output_count: int
    flags: List[str]
    fixes: Dict[str, Any]
ExtractionQCReport
object

QC Pipeline Steps

The extraction QC pipeline runs these checks in order:
1

Required Field Check

Drops entities missing required fields (derived from Pydantic schema).Required fields by type:
  • people: name
  • organizations: name
  • locations: name
  • events: title, description, event_type, start_date
2

Name Normalization

Normalizes names/titles:
  • Strip leading/trailing whitespace
  • Collapse multiple whitespace to single space
  • Unicode NFC normalization
3

Short Name Flagging

Flags (but does not drop) names shorter than min_name_len.
4

Exact Deduplication

Removes exact duplicates within the same article based on:
  • people: (name,)
  • organizations: (name, type)
  • locations: (name, type)
  • events: (title, start_date)
5

Variant Consolidation

Collapses name variants for organizations and locations:
  • Acronym matching (e.g., “ICRC” ↔ “International Committee of the Red Cross”)
  • Substring containment
  • Domain equivalence groups
Keeps the more canonical name (proper nouns over descriptions) and adds the other to aliases.
6

Low-Quality Name Detection

Flags if multiple entities have generic/descriptive names (e.g., “the camp”, “the facility”).

Example: Variant Consolidation

from src.utils.quality_controls import run_extraction_qc

entities = [
    {"name": "International Committee of the Red Cross", "type": "humanitarian"},
    {"name": "ICRC", "type": "humanitarian"},
    {"name": "Red Cross", "type": "humanitarian"},
]

cleaned, report = run_extraction_qc(
    entity_type="organizations",
    entities=entities,
    domain="guantanamo"
)

# Output:
# cleaned = [
#   {
#     "name": "International Committee of the Red Cross",
#     "type": "humanitarian",
#     "aliases": ["ICRC", "Red Cross"]
#   }
# ]
# report.deduped = 2
# report.fixes = {"collapsed_variants": 2}

Entity Relevance Filtering

filter_entities_by_article_relevance

from src.utils.quality_controls import filter_entities_by_article_relevance

kept, report = filter_entities_by_article_relevance(
    entity_type="organizations",
    entities=[
        {"name": "Red Cross", "aliases": ["ICRC"]},
        {"name": "Fictional Org"},  # Hallucinated
    ],
    article_text="The ICRC visited the camp in 2005.",
    domain="guantanamo",
    require_mention=True
)

# kept = [{"name": "Red Cross", "aliases": ["ICRC"]}]
# report.dropped = 1
Filter out entities whose names don’t appear in the source article text.
entity_type
str
required
Entity type: "people", "organizations", "locations", or "events".
entities
List[Dict[str, Any]]
required
Entities to validate.
article_text
str
required
Full article text to search for mentions.
domain
str
default:"guantanamo"
Domain name for loading equivalence groups.
require_mention
bool
default:"True"
Whether to enforce mention validation. If False, returns all entities.

Mention Detection Strategy

For each entity, builds a “needle set” from:
  1. Canonical name
  2. Aliases (from within-article dedup)
  3. Computed acronym (for orgs/locations)
  4. Domain equivalence group variants
If any needle appears in the article text (case-insensitive), the entity is kept. Special handling for short needles (≤3 chars):
  • Uses word-boundary regex \b..\b to avoid false positives
  • Example: “UN” matches “The UN peacekeepers” but not “under”

Profile QC

run_profile_qc

from src.utils.quality_controls import run_profile_qc

profile = {
    "text": "Abdul Rahman Ahmed was detained at Guantanamo Bay. ^[art-001]",
    "tags": ["detention"],
    "confidence": 0.85
}

fixed_profile, report = run_profile_qc(
    profile=profile,
    min_text_len=50,
    min_tags=1,
    require_citations=True
)
Run deterministic QC on a generated profile dictionary.
profile
Dict[str, Any]
required
Profile dictionary with text, tags, and confidence fields.
min_text_len
int
default:"100"
Minimum text length in characters. Profiles shorter than this fail QC.
min_tags
int
default:"2"
Minimum number of tags. Defaults to ["needs-review"] if below threshold.
require_citations
bool
default:"True"
Whether to flag profiles with no citations.
profile
Dict[str, Any]
Profile with fixes applied (confidence clamped, tags defaulted).
report
ProfileQCReport
Summary of QC checks (see below).

ProfileQCReport

class ProfileQCReport(BaseModel):
    text_length: int
    citation_count: int
    tag_count: int
    confidence: Optional[float]
    passed: bool
    flags: List[str]
    fixes: Dict[str, Any]
ProfileQCReport
object

Citation Pattern

Citations are detected using the regex pattern:
CITATION_RE = re.compile(r"\^\[([^\]\s]+)\]")
Format: ^[article_id] where article_id is non-empty and contains no whitespace. Example:
Abdul Rahman Ahmed was detained at Guantanamo Bay in 2002. ^[art-001]
He was released in 2009. ^[art-045]

Profile Grounding Verification

verify_profile_grounding

from src.utils.quality_controls import verify_profile_grounding

profile_text = """
Abdul Rahman Ahmed was detained in 2002. ^[art-001]
He was released in 2009. ^[art-045]
"""

article_texts = {
    "art-001": "Abdul Rahman Ahmed, a Yemeni national, was captured in 2002...",
    "art-045": "Ahmed was released from Guantanamo Bay in January 2009...",
}

report = verify_profile_grounding(
    profile_text=profile_text,
    article_texts=article_texts,
    model_type="gemini",
    max_article_chars=12000,
    max_claim_chars=600,
    min_grounding_score=0.7
)

print(f"Grounding score: {report.grounding_score:.1%}")  # 100%
print(f"Passed: {report.passed}")  # True
Verify that profile claims are supported by their cited sources.
This function makes LLM calls (one per cited article). Use sparingly in production.
profile_text
str
required
Profile text with citation markers.
article_texts
Dict[str, str]
required
Mapping of article IDs to full article text.
model_type
str
default:"gemini"
LLM backend: "gemini" (cloud) or "ollama" (local).
max_article_chars
int
default:"12000"
Maximum article text length to send to LLM (truncates longer articles).
max_claim_chars
int
default:"600"
Maximum claim text length (truncates longer claims).
min_grounding_score
float
default:"0.7"
Minimum grounding score to pass QC (0.0 to 1.0).
report
GroundingReport
Detailed verification report (see below).

GroundingReport

class GroundingReport(BaseModel):
    profile_text_hash: str
    total_citations: int
    verified: int
    unverified: int
    missing_source: int
    grounding_score: Optional[float]
    passed: bool
    flags: List[str]
    verifications: List[ClaimVerification]
GroundingReport
object

ClaimVerification

class ClaimVerification(BaseModel):
    article_id: str
    citation: str
    claim: str
    support_level: SupportLevel
    reasoning: Optional[str]
ClaimVerification
object

Example: Grounding Verification

from src.utils.quality_controls import verify_profile_grounding

profile_text = """
Mohamedou Ould Slahi was detained at Guantanamo Bay. ^[slahi-memoir]
He wrote a bestselling memoir about his experience. ^[slahi-memoir]
He was released in 2016. ^[missing-article]
"""

article_texts = {
    "slahi-memoir": "Mohamedou Ould Slahi published Guantanamo Diary in 2015...",
    # "missing-article" not provided
}

report = verify_profile_grounding(
    profile_text=profile_text,
    article_texts=article_texts,
    min_grounding_score=0.7
)

print(f"Total citations: {report.total_citations}")  # 3
print(f"Verified: {report.verified}")  # 2
print(f"Missing source: {report.missing_source}")  # 1
print(f"Grounding score: {report.grounding_score:.1%}")  # 66.7%
print(f"Passed: {report.passed}")  # False (< 0.7)
print(f"Flags: {report.flags}")  # ["missing_sources", "low_grounding_score"]

for v in report.verifications:
    print(f"{v.citation}: {v.support_level} - {v.reasoning}")

Utility Functions

normalize_name

from src.utils.quality_controls import normalize_name

name = normalize_name("  Abdul  Rahman   Ahmed  ")
print(name)  # "Abdul Rahman Ahmed"
Normalize an entity name:
  • Strip leading/trailing whitespace
  • Collapse multiple whitespace to single space
  • Unicode NFC normalization
Automatically called by run_extraction_qc() — you typically don’t need to call this directly.

Constants

Default Thresholds

from src.constants import (
    QC_MIN_NAME_LENGTH,
    PROFILE_QC_MIN_TEXT_LENGTH,
    PROFILE_QC_MIN_TAG_COUNT,
)

print(QC_MIN_NAME_LENGTH)  # 2
print(PROFILE_QC_MIN_TEXT_LENGTH)  # 100
print(PROFILE_QC_MIN_TAG_COUNT)  # 2
These constants define default QC thresholds used throughout the pipeline.

Integration Example

Complete Extraction Pipeline
from src.utils.quality_controls import (
    run_extraction_qc,
    filter_entities_by_article_relevance,
)
from src.engine.extractors import extract_entities

# 1. Extract entities (LLM call)
raw_entities = extract_entities(
    text=article_text,
    entity_type="organizations",
    domain="guantanamo"
)

# 2. Run extraction QC
cleaned, qc_report = run_extraction_qc(
    entity_type="organizations",
    entities=raw_entities,
    domain="guantanamo"
)

if qc_report.flags:
    print(f"QC flags: {qc_report.flags}")

# 3. Filter by article relevance
kept, relevance_report = filter_entities_by_article_relevance(
    entity_type="organizations",
    entities=cleaned,
    article_text=article_text,
    domain="guantanamo"
)

print(f"Extracted: {qc_report.input_count}")
print(f"After QC: {qc_report.output_count}")
print(f"After relevance: {relevance_report.output_count}")

See Also

Build docs developers (and LLMs) love