Domain Structure
A domain configuration consists of these files:Creating a New Domain
Initialize domain from template
configs/palestine_food_history/ with generic categories and prompts.Define entity types
Edit
categories/*.yaml files to specify relevant entity types and tags for your research domain.Customize extraction prompts
Edit
prompts/*.md files to provide domain-specific extraction instructions.Main Configuration (config.yaml)
The main configuration file controls data paths, thresholds, and performance settings.Basic Settings
Unique identifier for this research domain (lowercase, underscores allowed)
Brief description of the research focus
Path to source articles Parquet file
Directory for entity Parquet files and cache
Deduplication Configuration
Control how entities are deduplicated during the merge stage:similarity_thresholds
similarity_thresholds
Cosine similarity thresholds (0.0-1.0) for embedding-based merge decisions. Higher values require stronger similarity before merging.Recommendations:
- People: 0.80-0.85 (names vary widely, require high confidence)
- Organizations: 0.75-0.80 (moderate variation)
- Locations: 0.78-0.82 (geographic specificity important)
- Events: 0.70-0.78 (temporal context helps disambiguation)
lexical_blocking
lexical_blocking
Fast fuzzy string matching to narrow merge candidates before expensive embedding/LLM checks.
enabled: Enable/disable lexical blocking (default:true)threshold: RapidFuzz score 0-100 (default: 60)max_candidates: Maximum entities to pass to embedding stage (default: 50)
src/engine/mergers.py (uses RapidFuzz library)name_variants.equivalence_groups
name_variants.equivalence_groups
Define known aliases for instant blocking without LLM calls. When any name in a group is encountered, all variants are considered equivalent.Use cases:
- Government agencies:
["DoD", "Department of Defense", "Pentagon"] - Acronyms:
["ACLU", "American Civil Liberties Union"] - Geographic variants:
["U.S.", "United States", "USA"]
src/utils/name_variants.py:names_likely_samePerformance Configuration
Tuning guidelines:
- extract_workers: Set to number of CPU cores for local mode, or 8-16 for cloud APIs
- extract_per_article: Usually 4 (all entity types in parallel), set to 1 for debugging
- llm_in_flight: Balance between throughput and API rate limits (Gemini: 16-32)
- ollama_in_flight: Keep low (1-2) to avoid overwhelming local GPU
- embed_batch_size: Larger batches reduce API calls but increase latency (32-128)
Caching Configuration
Cache invalidation: Bump
cache.extraction.version when you:- Change extraction prompts
- Modify entity type definitions
- Upgrade to a new LLM model
- Change temperature or other generation parameters
Merge Evidence Configuration
Control how evidence text is built for embedding-based similarity:src/engine/mergers.py:_build_evidence_text
Evidence text format:
Category Definitions (categories/*.yaml)
Each entity type has a YAML file defining valid types and tags.Structure
Category File Conventions
Category File Conventions
Naming:
- File names:
people.yaml,organizations.yaml,locations.yaml,events.yaml - Type keys:
person_types,organization_types,location_types,event_types - Tag keys:
person_tags,event_tags(organizations/locations don’t have tags)
description: Clear, concise definition of the type/tagexamples: 2-3 real examples from your domain
- Use lowercase with underscores:
detention_facility,military_operation - Be specific to your domain: “detainee” vs. generic “prisoner”
- Include an
othertype as a catch-all
Example: Guantanamo Domain Categories
- People
- Organizations
- Locations
- Events
Types: detainee, military, government, lawyer, journalist, otherTags: civil_rights, immigration, defense, prosecution, policy, medical, intelligence, academic, religious, family, activistLocation:
configs/guantanamo/categories/people.yamlExtraction Prompts (prompts/*.md)
Prompts are written in Markdown and use placeholders for runtime interpolation.Entity Extraction Prompt
Available DomainConfig Methods
Available DomainConfig Methods
Load main config.yaml
Load category YAML (people.yaml, organizations.yaml, etc.)
Load extraction prompt Markdown
Load profile prompt (“generation”, “update”, “reflection”)
Get similarity threshold with fallback to default
Get lexical blocking settings with per-type overrides
Get worker/thread counts and queue limits
Get cache settings (extraction, embeddings, match_check, articles)
Get equivalence groups and acronym stopwords
Example: Custom Domain Configuration
Here’s how you might configure a domain for Soviet-Afghan War research:Best Practices
Start with the template
Start with the template
Always initialize new domains from the template:The template provides a working baseline with sensible defaults.
Test incrementally
Test incrementally
Process a small batch (2-5 articles) after each configuration change:Review extraction quality before processing the full corpus.
Be specific in prompts
Be specific in prompts
Generic prompts produce generic results. Include:
- Domain context: “You are analyzing articles about X…”
- Real examples: Show the LLM what good output looks like
- Edge cases: Address common errors (“Don’t extract ‘military bases’ as a name…”)
Tune thresholds empirically
Tune thresholds empirically
Default similarity thresholds (0.75-0.82) work well for many domains, but:
- Increase for domains with high name ambiguity (common names, many variants)
- Decrease for domains with consistent naming (technical terms, proper nouns)
- Monitor merge decisions in logs to identify false positives/negatives
Version extraction cache
Version extraction cache
Bump
cache.extraction.version when you change:- Extraction prompts (even minor wording changes)
- Category definitions (types or tags)
- LLM model or temperature
Next Steps
Processing Pipeline
Learn how domain configurations drive the 5-stage pipeline
Entity Types
Understand entity structure and required fields
Quickstart
Process your first batch of articles with a custom domain
System Architecture
Explore the producer-consumer model and evidence-first merge