Overview
Entity extraction is the first step in building a knowledge graph. It identifies named entities (people, organizations, locations, concepts, etc.) from your document chunks. Arcana provides two built-in implementations:- NER (Named Entity Recognition) - Fast, local, using Bumblebee ML models
- LLM - Flexible, accurate, using your configured language model
How Entity Extraction Works
During graph building (typically during ingest), each chunk is processed:lib/arcana/graph/graph_builder.ex:39
Entity Extractors
NER Extractor (Default)
Uses Bumblebee’sdslim/distilbert-NER model for fast, local entity extraction.
Configuration:
person- Individual peopleorganization- Companies, institutionslocation- Geographic placesconcept- Miscellaneous named entitiesother- Fallback type
lib/arcana/graph/entity_extractor/ner.ex:24
Label Mapping:
The NER model outputs BIO tags which are mapped to types:
lib/arcana/graph/entity_extractor/ner.ex:78
Advantages:
- ⚡ Fast - No LLM calls required
- 💰 Cost-effective - Runs locally
- 🔒 Private - No external API calls
- ⚙️ Reliable - Consistent output
- Limited to 4 entity types
- May miss domain-specific entities
- Less context-aware than LLMs
LLM Extractor
Uses your configured LLM for flexible, context-aware entity extraction. Configuration:person- People, including titles (“Dr. Jane Smith”, “CEO John Doe”)organization- Companies, institutions, governments, teamslocation- Geographic places, addresses, facilitiesevent- Conferences, incidents, historical momentsconcept- Abstract ideas, theories, methodologiestechnology- Products, tools, software, hardwarerole- Job titles, positionspublication- Papers, books, articlesmedia- Movies, songs, artworksaward- Awards, certifications, honorsstandard- Specifications, protocols, regulationslanguage- Programming or natural languagesother- Entities that don’t fit above categories
lib/arcana/graph/entity_extractor/llm.ex:50
Advantages:
- 🎯 Accurate - Understands context
- 🔧 Flexible - Supports 12+ entity types
- 🎓 Domain-aware - Recognizes specialized terms
- 📝 Descriptive - Can include entity descriptions
- 🐌 Slower - Requires LLM calls
- 💸 Costly - LLM API fees
- 🎲 Non-deterministic - Output may vary
Custom Extractors
Implement theArcana.Graph.EntityExtractor behaviour for custom extraction:
lib/arcana/graph/entity_extractor.ex:71
Entity Format
All extractors must return entities as maps with: Required Fields::name(string) - The entity name:type(string) - Entity type as string (e.g., “person”, “organization”)
:span_start(integer) - Character offset where entity starts in text:span_end(integer) - Character offset where entity ends:score(float) - Confidence score (0.0-1.0):description(string) - Brief description of the entity
lib/arcana/graph/entity_extractor.ex:55
Real Examples from Source
Example 1: NER Extraction
Fromlib/arcana/graph/entity_extractor/ner.ex:40:
Example 2: LLM Prompt
Fromlib/arcana/graph/entity_extractor/llm.ex:91:
Example 3: Batch Processing
Fromlib/arcana/graph/entity_extractor.ex:131:
Configuration Options
Inline Function
Module with Options
Per-Call Override
Performance Considerations
NER Extractor:- ~50-100ms per chunk (local inference)
- Memory: ~500MB for model
- Parallelizable: Yes (multiple servings)
- ~500-2000ms per chunk (API latency)
- Cost: ~$0.001-0.01 per chunk (varies by model)
- Parallelizable: Yes (concurrent API calls)
- Use NER for initial extraction, LLM for refinement
- Implement
extract_batch/2for batch API calls - Cache entities by chunk hash
- Use concurrent processing (see
lib/arcana/graph.ex:361)
Next Steps
- Relationships - Extract relationships between entities
- Communities - Detect entity communities
- Search - Use entities for graph search