Skip to main content

Overview

This example demonstrates how to extract structured data from unstructured documents using AXON’s probe operation, epistemic types, and validation. It shows how to handle missing fields, ensure data quality, and maintain confidence scores.

Use Case

Extract structured information from:
  • Resumes and CVs
  • Invoices and receipts
  • Product descriptions
  • News articles
  • Customer feedback
  • Research papers

Complete Code

data_extraction.axon
// AXON Example — Data Extraction
// Extract structured data with validation and confidence tracking

persona DataExtractor {
  domain: ["information extraction", "NLP", "data processing"]
  tone: precise
  confidence_threshold: 0.80
}

context ExtractionMode {
  memory: none
  language: "en"
  depth: thorough
  max_tokens: 2048
  temperature: 0.1
}

anchor NoGuessing {
  require: text_evidence
  confidence_floor: 0.75
  unknown_response: "Field not found in source text"
  on_violation: raise AnchorBreachError
}

type Email where matches(pattern: email_regex)

type PhoneNumber where matches(pattern: phone_regex)

type Currency(0.0..1000000.0)

type Person {
  name: FactualClaim,
  email: Email?,
  phone: PhoneNumber?,
  location: FactualClaim?
}

type Experience {
  company: FactualClaim,
  role: FactualClaim,
  duration: FactualClaim,
  description: Opinion?
}

type ResumeData {
  person: Person,
  experiences: List<Experience>,
  skills: List<FactualClaim>,
  education: List<FactualClaim>,
  confidence: ConfidenceScore
}

flow ExtractResume(doc: Document) -> ResumeData {
  step ExtractPerson {
    given: doc
    probe doc for [name, email, phone, location]
    output: Person
  }
  
  validate ExtractPerson.output against PersonSchema {
    if confidence < 0.80 -> refine(max_attempts: 2)
    if name == ∅ -> raise MissingRequiredFieldError
  }
  
  step ExtractExperience {
    given: doc
    ask: "Extract all work experience entries with company, role, and duration"
    output: List<Experience>
  }
  
  step ExtractSkills {
    given: doc
    probe doc for [skills, technologies, certifications]
    output: SkillMap
  }
  
  step ExtractEducation {
    given: doc
    ask: "Extract educational background"
    output: List<FactualClaim>
  }
  
  weave [
    ExtractPerson.output,
    ExtractExperience.output,
    ExtractSkills.output,
    ExtractEducation.output
  ] into ResumeData {
    format: StructuredReport
  }
}

run ExtractResume(resumeDoc)
  as DataExtractor
  within ExtractionMode
  constrained_by [NoGuessing]
  on_failure: retry(backoff: linear)
  output_to: "extracted.json"
  effort: medium

Key Components

Persona: DataExtractor

persona DataExtractor {
  domain: ["information extraction", "NLP", "data processing"]
  tone: precise
  confidence_threshold: 0.80
}
Defines an extraction specialist:
  • Domain: Information extraction, NLP, data processing
  • Tone: Precise (exact, no embellishment)
  • High threshold: 0.80 for accurate extraction

Context: ExtractionMode

context ExtractionMode {
  memory: none
  language: "en"
  depth: thorough
  max_tokens: 2048
  temperature: 0.1
}
Configured for extraction:
  • Stateless: No memory (each extraction independent)
  • Thorough: Careful examination
  • Very low temperature: 0.1 for deterministic extraction

Anchor: NoGuessing

anchor NoGuessing {
  require: text_evidence
  confidence_floor: 0.75
  unknown_response: "Field not found in source text"
  on_violation: raise AnchorBreachError
}
Prevents hallucination:
  • Requires: Evidence from source text
  • Minimum confidence: 0.75
  • Explicit unknowns: “Field not found” instead of guessing
Never let the LLM “fill in” missing fields with plausible guesses. Use the NoGuessing anchor to ensure all extractions cite source text.

Custom Types with Validation

type Email where matches(pattern: email_regex)
type PhoneNumber where matches(pattern: phone_regex)
Refinement types with pattern matching:
  • Compile-time guarantee of format
  • Runtime validation
type Currency(0.0..1000000.0)
Range-constrained for monetary values.
type Person {
  name: FactualClaim,
  email: Email?,
  phone: PhoneNumber?,
  location: FactualClaim?
}
Structured person data:
  • name: Required factual claim
  • email, phone, location: Optional validated fields
type Experience {
  company: FactualClaim,
  role: FactualClaim,
  duration: FactualClaim,
  description: Opinion?
}
Work experience entry:
  • Company, role, duration: Facts
  • Description: Opinion (subjective characterization)

Flow: ExtractResume

Four-step extraction pipeline: Step 1: ExtractPerson
step ExtractPerson {
  given: doc
  probe doc for [name, email, phone, location]
  output: Person
}
Uses probe for targeted field extraction. Validation
validate ExtractPerson.output against PersonSchema {
  if confidence < 0.80 -> refine(max_attempts: 2)
  if name == ∅ -> raise MissingRequiredFieldError
}
Ensures:
  • High confidence (≥0.80)
  • Required field (name) present
Step 2: ExtractExperience
step ExtractExperience {
  given: doc
  ask: "Extract all work experience entries with company, role, and duration"
  output: List<Experience>
}
Extracts multiple experience entries. Step 3: ExtractSkills
step ExtractSkills {
  given: doc
  probe doc for [skills, technologies, certifications]
  output: SkillMap
}
Probes for skills-related fields. Step 4: ExtractEducation
step ExtractEducation {
  given: doc
  ask: "Extract educational background"
  output: List<FactualClaim>
}
Extracts education as list of facts. Synthesis
weave [
  ExtractPerson.output,
  ExtractExperience.output,
  ExtractSkills.output,
  ExtractEducation.output
] into ResumeData {
  format: StructuredReport
}
Combines all extractions into structured output.

Usage

Run Extraction

# Validate
axon check data_extraction.axon

# Compile
axon compile data_extraction.axon

# Execute
axon run data_extraction.axon --backend anthropic --trace

Example Input (Resume)

John Smith
Senior Software Engineer
[email protected] | (555) 123-4567 | San Francisco, CA

EXPERIENCE

TechCorp Inc. | Senior Software Engineer | 2020 - Present
- Led team of 5 engineers building cloud infrastructure
- Designed and implemented microservices architecture
- Reduced deployment time by 60%

StartupXYZ | Software Engineer | 2018 - 2020
- Developed full-stack web applications using React and Node.js
- Implemented CI/CD pipelines

SKILLS
Python, JavaScript, React, Node.js, Docker, Kubernetes, AWS, PostgreSQL

EDUCATION
B.S. Computer Science, Stanford University, 2018

Example Output

{
  "type": "ResumeData",
  "person": {
    "name": "John Smith",
    "email": "[email protected]",
    "phone": "(555) 123-4567",
    "location": "San Francisco, CA"
  },
  "experiences": [
    {
      "company": "TechCorp Inc.",
      "role": "Senior Software Engineer",
      "duration": "2020 - Present",
      "description": "Led team of 5 engineers building cloud infrastructure"
    },
    {
      "company": "StartupXYZ",
      "role": "Software Engineer",
      "duration": "2018 - 2020",
      "description": "Developed full-stack web applications"
    }
  ],
  "skills": [
    "Python",
    "JavaScript",
    "React",
    "Node.js",
    "Docker",
    "Kubernetes",
    "AWS",
    "PostgreSQL"
  ],
  "education": [
    "B.S. Computer Science, Stanford University, 2018"
  ],
  "confidence": 0.92
}

Advanced Patterns

Invoice Extraction

type Invoice {
  invoice_number: FactualClaim,
  date: FactualClaim,
  vendor: FactualClaim,
  total: Currency,
  items: List<LineItem>,
  confidence: ConfidenceScore
}

type LineItem {
  description: FactualClaim,
  quantity: Integer,
  price: Currency,
  total: Currency
}

flow ExtractInvoice(doc: Document) -> Invoice {
  step ExtractHeader {
    given: doc
    probe doc for [invoice_number, date, vendor, total]
    output: InvoiceHeader
  }
  
  step ExtractLineItems {
    given: doc
    ask: "Extract all line items with description, quantity, and price"
    output: List<LineItem>
  }
  
  validate ExtractLineItems.output against LineItemSchema {
    if any_price < 0.0 -> raise InvalidDataError
    if sum(items.total) != header.total -> warn "Total mismatch"
  }
  
  weave [ExtractHeader.output, ExtractLineItems.output] into Invoice
}

Product Data Extraction

type Product {
  name: FactualClaim,
  price: Currency,
  description: FactualClaim,
  features: List<FactualClaim>,
  reviews: Opinion?,
  availability: FactualClaim
}

flow ExtractProduct(doc: Document) -> Product {
  step ExtractBasics {
    given: doc
    probe doc for [name, price, description, availability]
    output: ProductBasics
  }
  
  step ExtractFeatures {
    given: doc
    ask: "List all product features and specifications"
    output: List<FactualClaim>
  }
  
  step ExtractReviews {
    given: doc
    ask: "Summarize customer reviews and opinions"
    output: Opinion
  }
  
  weave [ProductBasics, ExtractFeatures.output, ExtractReviews.output] into Product
}

Multi-Document Extraction

flow ExtractBatch(docs: List<Document>) -> List<ResumeData> {
  step ExtractAll {
    given: docs
    ask: "Extract resume data from each document"
    output: List<ResumeData>
  }
  
  validate ExtractAll.output against BatchSchema {
    if any_confidence < 0.75 -> refine(max_attempts: 1)
  }
}

Incremental Extraction with Memory

context IncrementalMode {
  memory: session
  language: "en"
  depth: thorough
}

flow IncrementalExtract(doc: Document) -> ExtractedData {
  recall("previous extractions") from SessionMemory
  
  step Extract {
    given: [doc, PreviousExtractions]
    ask: "Extract new information, avoiding duplicates"
    output: NewData
  }
  
  remember(NewData) -> SessionMemory
}

Best Practices

1. Use Probe for Targeted Extraction

// ✅ Good: Targeted field extraction
step Extract {
  probe doc for [name, email, phone]
  output: Person
}

// ❌ Less efficient: Open-ended
step Extract {
  ask: "Extract person information"
  output: Person
}

2. Validate Required Fields

validate Person against PersonSchema {
  if name == ∅ -> raise MissingRequiredFieldError
  if email == ∅ -> warn "Email not found"
}

3. Use Optional Types for Missing Data

type Person {
  name: FactualClaim,      // Required
  email: Email?,           // Optional
  phone: PhoneNumber?      // Optional
}

4. Apply Range Constraints

type Price(0.0..1000000.0)  // Prevent negative or absurd prices
type Quantity(1..10000)     // Reasonable quantity range

5. Use Very Low Temperature

context ExtractionMode {
  temperature: 0.1  // Deterministic extraction
}

6. Require Text Evidence

anchor NoGuessing {
  require: text_evidence
  unknown_response: "Field not found"
}

Contract Analyzer

Legal contract analysis with entity extraction

Sentiment Analysis

Analyze text sentiment with confidence tracking

Multi-Step Reasoning

Complex reasoning with chain-of-thought
  • Flow — Probe and extraction operations
  • Types — Refinement types and validation
  • Anchor — Prevent hallucination
  • Persona — Define extraction specialists

Build docs developers (and LLMs) love