Overview
This example demonstrates how to extract structured data from unstructured documents using AXON’s probe operation, epistemic types, and validation. It shows how to handle missing fields, ensure data quality, and maintain confidence scores.
Use Case
Extract structured information from:
Resumes and CVs
Invoices and receipts
Product descriptions
News articles
Customer feedback
Research papers
Complete Code
// AXON Example — Data Extraction
// Extract structured data with validation and confidence tracking
persona DataExtractor {
domain: ["information extraction", "NLP", "data processing"]
tone: precise
confidence_threshold: 0.80
}
context ExtractionMode {
memory: none
language: "en"
depth: thorough
max_tokens: 2048
temperature: 0.1
}
anchor NoGuessing {
require: text_evidence
confidence_floor: 0.75
unknown_response: "Field not found in source text"
on_violation: raise AnchorBreachError
}
type Email where matches(pattern: email_regex)
type PhoneNumber where matches(pattern: phone_regex)
type Currency(0.0..1000000.0)
type Person {
name: FactualClaim,
email: Email?,
phone: PhoneNumber?,
location: FactualClaim?
}
type Experience {
company: FactualClaim,
role: FactualClaim,
duration: FactualClaim,
description: Opinion?
}
type ResumeData {
person: Person,
experiences: List<Experience>,
skills: List<FactualClaim>,
education: List<FactualClaim>,
confidence: ConfidenceScore
}
flow ExtractResume(doc: Document) -> ResumeData {
step ExtractPerson {
given: doc
probe doc for [name, email, phone, location]
output: Person
}
validate ExtractPerson.output against PersonSchema {
if confidence < 0.80 -> refine(max_attempts: 2)
if name == ∅ -> raise MissingRequiredFieldError
}
step ExtractExperience {
given: doc
ask: "Extract all work experience entries with company, role, and duration"
output: List<Experience>
}
step ExtractSkills {
given: doc
probe doc for [skills, technologies, certifications]
output: SkillMap
}
step ExtractEducation {
given: doc
ask: "Extract educational background"
output: List<FactualClaim>
}
weave [
ExtractPerson.output,
ExtractExperience.output,
ExtractSkills.output,
ExtractEducation.output
] into ResumeData {
format: StructuredReport
}
}
run ExtractResume(resumeDoc)
as DataExtractor
within ExtractionMode
constrained_by [NoGuessing]
on_failure: retry(backoff: linear)
output_to: "extracted.json"
effort: medium
Key Components
persona DataExtractor {
domain: ["information extraction", "NLP", "data processing"]
tone: precise
confidence_threshold: 0.80
}
Defines an extraction specialist:
Domain : Information extraction, NLP, data processing
Tone : Precise (exact, no embellishment)
High threshold : 0.80 for accurate extraction
context ExtractionMode {
memory: none
language: "en"
depth: thorough
max_tokens: 2048
temperature: 0.1
}
Configured for extraction:
Stateless : No memory (each extraction independent)
Thorough : Careful examination
Very low temperature : 0.1 for deterministic extraction
Anchor: NoGuessing
anchor NoGuessing {
require: text_evidence
confidence_floor: 0.75
unknown_response: "Field not found in source text"
on_violation: raise AnchorBreachError
}
Prevents hallucination:
Requires : Evidence from source text
Minimum confidence : 0.75
Explicit unknowns : “Field not found” instead of guessing
Never let the LLM “fill in” missing fields with plausible guesses. Use the NoGuessing anchor to ensure all extractions cite source text.
Custom Types with Validation
type Email where matches(pattern: email_regex)
type PhoneNumber where matches(pattern: phone_regex)
Refinement types with pattern matching:
Compile-time guarantee of format
Runtime validation
type Currency(0.0..1000000.0)
Range-constrained for monetary values.
type Person {
name: FactualClaim,
email: Email?,
phone: PhoneNumber?,
location: FactualClaim?
}
Structured person data:
name: Required factual claim
email, phone, location: Optional validated fields
type Experience {
company: FactualClaim,
role: FactualClaim,
duration: FactualClaim,
description: Opinion?
}
Work experience entry:
Company, role, duration: Facts
Description: Opinion (subjective characterization)
Four-step extraction pipeline:
Step 1: ExtractPerson
step ExtractPerson {
given: doc
probe doc for [name, email, phone, location]
output: Person
}
Uses probe for targeted field extraction.
Validation
validate ExtractPerson.output against PersonSchema {
if confidence < 0.80 -> refine(max_attempts: 2)
if name == ∅ -> raise MissingRequiredFieldError
}
Ensures:
High confidence (≥0.80)
Required field (name) present
Step 2: ExtractExperience
step ExtractExperience {
given: doc
ask: "Extract all work experience entries with company, role, and duration"
output: List<Experience>
}
Extracts multiple experience entries.
Step 3: ExtractSkills
step ExtractSkills {
given: doc
probe doc for [skills, technologies, certifications]
output: SkillMap
}
Probes for skills-related fields.
Step 4: ExtractEducation
step ExtractEducation {
given: doc
ask: "Extract educational background"
output: List<FactualClaim>
}
Extracts education as list of facts.
Synthesis
weave [
ExtractPerson.output,
ExtractExperience.output,
ExtractSkills.output,
ExtractEducation.output
] into ResumeData {
format: StructuredReport
}
Combines all extractions into structured output.
Usage
# Validate
axon check data_extraction.axon
# Compile
axon compile data_extraction.axon
# Execute
axon run data_extraction.axon --backend anthropic --trace
John Smith
Senior Software Engineer
[email protected] | (555) 123-4567 | San Francisco, CA
EXPERIENCE
TechCorp Inc. | Senior Software Engineer | 2020 - Present
- Led team of 5 engineers building cloud infrastructure
- Designed and implemented microservices architecture
- Reduced deployment time by 60%
StartupXYZ | Software Engineer | 2018 - 2020
- Developed full-stack web applications using React and Node.js
- Implemented CI/CD pipelines
SKILLS
Python, JavaScript, React, Node.js, Docker, Kubernetes, AWS, PostgreSQL
EDUCATION
B.S. Computer Science, Stanford University, 2018
Example Output
{
"type" : "ResumeData" ,
"person" : {
"name" : "John Smith" ,
"email" : "[email protected] " ,
"phone" : "(555) 123-4567" ,
"location" : "San Francisco, CA"
},
"experiences" : [
{
"company" : "TechCorp Inc." ,
"role" : "Senior Software Engineer" ,
"duration" : "2020 - Present" ,
"description" : "Led team of 5 engineers building cloud infrastructure"
},
{
"company" : "StartupXYZ" ,
"role" : "Software Engineer" ,
"duration" : "2018 - 2020" ,
"description" : "Developed full-stack web applications"
}
],
"skills" : [
"Python" ,
"JavaScript" ,
"React" ,
"Node.js" ,
"Docker" ,
"Kubernetes" ,
"AWS" ,
"PostgreSQL"
],
"education" : [
"B.S. Computer Science, Stanford University, 2018"
],
"confidence" : 0.92
}
Advanced Patterns
type Invoice {
invoice_number: FactualClaim,
date: FactualClaim,
vendor: FactualClaim,
total: Currency,
items: List<LineItem>,
confidence: ConfidenceScore
}
type LineItem {
description: FactualClaim,
quantity: Integer,
price: Currency,
total: Currency
}
flow ExtractInvoice(doc: Document) -> Invoice {
step ExtractHeader {
given: doc
probe doc for [invoice_number, date, vendor, total]
output: InvoiceHeader
}
step ExtractLineItems {
given: doc
ask: "Extract all line items with description, quantity, and price"
output: List<LineItem>
}
validate ExtractLineItems.output against LineItemSchema {
if any_price < 0.0 -> raise InvalidDataError
if sum(items.total) != header.total -> warn "Total mismatch"
}
weave [ExtractHeader.output, ExtractLineItems.output] into Invoice
}
type Product {
name: FactualClaim,
price: Currency,
description: FactualClaim,
features: List<FactualClaim>,
reviews: Opinion?,
availability: FactualClaim
}
flow ExtractProduct(doc: Document) -> Product {
step ExtractBasics {
given: doc
probe doc for [name, price, description, availability]
output: ProductBasics
}
step ExtractFeatures {
given: doc
ask: "List all product features and specifications"
output: List<FactualClaim>
}
step ExtractReviews {
given: doc
ask: "Summarize customer reviews and opinions"
output: Opinion
}
weave [ProductBasics, ExtractFeatures.output, ExtractReviews.output] into Product
}
flow ExtractBatch(docs: List<Document>) -> List<ResumeData> {
step ExtractAll {
given: docs
ask: "Extract resume data from each document"
output: List<ResumeData>
}
validate ExtractAll.output against BatchSchema {
if any_confidence < 0.75 -> refine(max_attempts: 1)
}
}
context IncrementalMode {
memory: session
language: "en"
depth: thorough
}
flow IncrementalExtract(doc: Document) -> ExtractedData {
recall("previous extractions") from SessionMemory
step Extract {
given: [doc, PreviousExtractions]
ask: "Extract new information, avoiding duplicates"
output: NewData
}
remember(NewData) -> SessionMemory
}
Best Practices
// ✅ Good: Targeted field extraction
step Extract {
probe doc for [name, email, phone]
output: Person
}
// ❌ Less efficient: Open-ended
step Extract {
ask: "Extract person information"
output: Person
}
2. Validate Required Fields
validate Person against PersonSchema {
if name == ∅ -> raise MissingRequiredFieldError
if email == ∅ -> warn "Email not found"
}
3. Use Optional Types for Missing Data
type Person {
name: FactualClaim, // Required
email: Email?, // Optional
phone: PhoneNumber? // Optional
}
4. Apply Range Constraints
type Price(0.0..1000000.0) // Prevent negative or absurd prices
type Quantity(1..10000) // Reasonable quantity range
5. Use Very Low Temperature
context ExtractionMode {
temperature: 0.1 // Deterministic extraction
}
6. Require Text Evidence
anchor NoGuessing {
require: text_evidence
unknown_response: "Field not found"
}
Contract Analyzer Legal contract analysis with entity extraction
Sentiment Analysis Analyze text sentiment with confidence tracking
Multi-Step Reasoning Complex reasoning with chain-of-thought
Flow — Probe and extraction operations
Types — Refinement types and validation
Anchor — Prevent hallucination
Persona — Define extraction specialists