Data Extraction

Overview

This example demonstrates how to extract structured data from unstructured documents using AXON’s probe operation, epistemic types, and validation. It shows how to handle missing fields, ensure data quality, and maintain confidence scores.

Use Case

Extract structured information from:

Resumes and CVs
Invoices and receipts
Product descriptions
News articles
Customer feedback
Research papers

Complete Code

data_extraction.axon

// AXON Example — Data Extraction
// Extract structured data with validation and confidence tracking

persona DataExtractor {
  domain: ["information extraction", "NLP", "data processing"]
  tone: precise
  confidence_threshold: 0.80
}

context ExtractionMode {
  memory: none
  language: "en"
  depth: thorough
  max_tokens: 2048
  temperature: 0.1
}

anchor NoGuessing {
  require: text_evidence
  confidence_floor: 0.75
  unknown_response: "Field not found in source text"
  on_violation: raise AnchorBreachError
}

type Email where matches(pattern: email_regex)

type PhoneNumber where matches(pattern: phone_regex)

type Currency(0.0..1000000.0)

type Person {
  name: FactualClaim,
  email: Email?,
  phone: PhoneNumber?,
  location: FactualClaim?
}

type Experience {
  company: FactualClaim,
  role: FactualClaim,
  duration: FactualClaim,
  description: Opinion?
}

type ResumeData {
  person: Person,
  experiences: List<Experience>,
  skills: List<FactualClaim>,
  education: List<FactualClaim>,
  confidence: ConfidenceScore
}

flow ExtractResume(doc: Document) -> ResumeData {
  step ExtractPerson {
    given: doc
    probe doc for [name, email, phone, location]
    output: Person
  }
  
  validate ExtractPerson.output against PersonSchema {
    if confidence < 0.80 -> refine(max_attempts: 2)
    if name == ∅ -> raise MissingRequiredFieldError
  }
  
  step ExtractExperience {
    given: doc
    ask: "Extract all work experience entries with company, role, and duration"
    output: List<Experience>
  }
  
  step ExtractSkills {
    given: doc
    probe doc for [skills, technologies, certifications]
    output: SkillMap
  }
  
  step ExtractEducation {
    given: doc
    ask: "Extract educational background"
    output: List<FactualClaim>
  }
  
  weave [
    ExtractPerson.output,
    ExtractExperience.output,
    ExtractSkills.output,
    ExtractEducation.output
  ] into ResumeData {
    format: StructuredReport
  }
}

run ExtractResume(resumeDoc)
  as DataExtractor
  within ExtractionMode
  constrained_by [NoGuessing]
  on_failure: retry(backoff: linear)
  output_to: "extracted.json"
  effort: medium

Key Components

Persona: DataExtractor

persona DataExtractor {
  domain: ["information extraction", "NLP", "data processing"]
  tone: precise
  confidence_threshold: 0.80
}

Defines an extraction specialist:

Domain: Information extraction, NLP, data processing
Tone: Precise (exact, no embellishment)
High threshold: 0.80 for accurate extraction

Context: ExtractionMode

context ExtractionMode {
  memory: none
  language: "en"
  depth: thorough
  max_tokens: 2048
  temperature: 0.1
}

Configured for extraction:

Stateless: No memory (each extraction independent)
Thorough: Careful examination
Very low temperature: 0.1 for deterministic extraction

Anchor: NoGuessing

anchor NoGuessing {
  require: text_evidence
  confidence_floor: 0.75
  unknown_response: "Field not found in source text"
  on_violation: raise AnchorBreachError
}

Prevents hallucination:

Requires: Evidence from source text
Minimum confidence: 0.75
Explicit unknowns: “Field not found” instead of guessing

Never let the LLM “fill in” missing fields with plausible guesses. Use the NoGuessing anchor to ensure all extractions cite source text.

Custom Types with Validation

type Email where matches(pattern: email_regex)
type PhoneNumber where matches(pattern: phone_regex)

Refinement types with pattern matching:

Compile-time guarantee of format
Runtime validation

type Currency(0.0..1000000.0)

Range-constrained for monetary values.

type Person {
  name: FactualClaim,
  email: Email?,
  phone: PhoneNumber?,
  location: FactualClaim?
}

Structured person data:

name: Required factual claim
email, phone, location: Optional validated fields

type Experience {
  company: FactualClaim,
  role: FactualClaim,
  duration: FactualClaim,
  description: Opinion?
}

Work experience entry:

Company, role, duration: Facts
Description: Opinion (subjective characterization)

Flow: ExtractResume

Four-step extraction pipeline: Step 1: ExtractPerson

step ExtractPerson {
  given: doc
  probe doc for [name, email, phone, location]
  output: Person
}

Uses probe for targeted field extraction. Validation

validate ExtractPerson.output against PersonSchema {
  if confidence < 0.80 -> refine(max_attempts: 2)
  if name == ∅ -> raise MissingRequiredFieldError
}

Ensures:

High confidence (≥0.80)
Required field (name) present

Step 2: ExtractExperience

step ExtractExperience {
  given: doc
  ask: "Extract all work experience entries with company, role, and duration"
  output: List<Experience>
}

Extracts multiple experience entries. Step 3: ExtractSkills

step ExtractSkills {
  given: doc
  probe doc for [skills, technologies, certifications]
  output: SkillMap
}

Probes for skills-related fields. Step 4: ExtractEducation

step ExtractEducation {
  given: doc
  ask: "Extract educational background"
  output: List<FactualClaim>
}

Extracts education as list of facts. Synthesis

weave [
  ExtractPerson.output,
  ExtractExperience.output,
  ExtractSkills.output,
  ExtractEducation.output
] into ResumeData {
  format: StructuredReport
}

Combines all extractions into structured output.

Usage

Run Extraction

# Validate
axon check data_extraction.axon

# Compile
axon compile data_extraction.axon

# Execute
axon run data_extraction.axon --backend anthropic --trace

Example Input (Resume)

John Smith
Senior Software Engineer
[email protected] | (555) 123-4567 | San Francisco, CA

EXPERIENCE

TechCorp Inc. | Senior Software Engineer | 2020 - Present
- Led team of 5 engineers building cloud infrastructure
- Designed and implemented microservices architecture
- Reduced deployment time by 60%

StartupXYZ | Software Engineer | 2018 - 2020
- Developed full-stack web applications using React and Node.js
- Implemented CI/CD pipelines

SKILLS
Python, JavaScript, React, Node.js, Docker, Kubernetes, AWS, PostgreSQL

EDUCATION
B.S. Computer Science, Stanford University, 2018

Example Output

{
  "type": "ResumeData",
  "person": {
    "name": "John Smith",
    "email": "[email protected]",
    "phone": "(555) 123-4567",
    "location": "San Francisco, CA"
  },
  "experiences": [
    {
      "company": "TechCorp Inc.",
      "role": "Senior Software Engineer",
      "duration": "2020 - Present",
      "description": "Led team of 5 engineers building cloud infrastructure"
    },
    {
      "company": "StartupXYZ",
      "role": "Software Engineer",
      "duration": "2018 - 2020",
      "description": "Developed full-stack web applications"
    }
  ],
  "skills": [
    "Python",
    "JavaScript",
    "React",
    "Node.js",
    "Docker",
    "Kubernetes",
    "AWS",
    "PostgreSQL"
  ],
  "education": [
    "B.S. Computer Science, Stanford University, 2018"
  ],
  "confidence": 0.92
}

Advanced Patterns

Invoice Extraction

type Invoice {
  invoice_number: FactualClaim,
  date: FactualClaim,
  vendor: FactualClaim,
  total: Currency,
  items: List<LineItem>,
  confidence: ConfidenceScore
}

type LineItem {
  description: FactualClaim,
  quantity: Integer,
  price: Currency,
  total: Currency
}

flow ExtractInvoice(doc: Document) -> Invoice {
  step ExtractHeader {
    given: doc
    probe doc for [invoice_number, date, vendor, total]
    output: InvoiceHeader
  }
  
  step ExtractLineItems {
    given: doc
    ask: "Extract all line items with description, quantity, and price"
    output: List<LineItem>
  }
  
  validate ExtractLineItems.output against LineItemSchema {
    if any_price < 0.0 -> raise InvalidDataError
    if sum(items.total) != header.total -> warn "Total mismatch"
  }
  
  weave [ExtractHeader.output, ExtractLineItems.output] into Invoice
}

Product Data Extraction

type Product {
  name: FactualClaim,
  price: Currency,
  description: FactualClaim,
  features: List<FactualClaim>,
  reviews: Opinion?,
  availability: FactualClaim
}

flow ExtractProduct(doc: Document) -> Product {
  step ExtractBasics {
    given: doc
    probe doc for [name, price, description, availability]
    output: ProductBasics
  }
  
  step ExtractFeatures {
    given: doc
    ask: "List all product features and specifications"
    output: List<FactualClaim>
  }
  
  step ExtractReviews {
    given: doc
    ask: "Summarize customer reviews and opinions"
    output: Opinion
  }
  
  weave [ProductBasics, ExtractFeatures.output, ExtractReviews.output] into Product
}

Multi-Document Extraction

flow ExtractBatch(docs: List<Document>) -> List<ResumeData> {
  step ExtractAll {
    given: docs
    ask: "Extract resume data from each document"
    output: List<ResumeData>
  }
  
  validate ExtractAll.output against BatchSchema {
    if any_confidence < 0.75 -> refine(max_attempts: 1)
  }
}

Incremental Extraction with Memory

context IncrementalMode {
  memory: session
  language: "en"
  depth: thorough
}

flow IncrementalExtract(doc: Document) -> ExtractedData {
  recall("previous extractions") from SessionMemory
  
  step Extract {
    given: [doc, PreviousExtractions]
    ask: "Extract new information, avoiding duplicates"
    output: NewData
  }
  
  remember(NewData) -> SessionMemory
}

Best Practices

1. Use Probe for Targeted Extraction

// ✅ Good: Targeted field extraction
step Extract {
  probe doc for [name, email, phone]
  output: Person
}

// ❌ Less efficient: Open-ended
step Extract {
  ask: "Extract person information"
  output: Person
}

2. Validate Required Fields

validate Person against PersonSchema {
  if name == ∅ -> raise MissingRequiredFieldError
  if email == ∅ -> warn "Email not found"
}

3. Use Optional Types for Missing Data

type Person {
  name: FactualClaim,      // Required
  email: Email?,           // Optional
  phone: PhoneNumber?      // Optional
}

4. Apply Range Constraints

type Price(0.0..1000000.0)  // Prevent negative or absurd prices
type Quantity(1..10000)     // Reasonable quantity range

5. Use Very Low Temperature

context ExtractionMode {
  temperature: 0.1  // Deterministic extraction
}

6. Require Text Evidence

anchor NoGuessing {
  require: text_evidence
  unknown_response: "Field not found"
}

Contract Analyzer

Legal contract analysis with entity extraction

Sentiment Analysis

Analyze text sentiment with confidence tracking

Multi-Step Reasoning

Complex reasoning with chain-of-thought

Flow — Probe and extraction operations
Types — Refinement types and validation
Anchor — Prevent hallucination
Persona — Define extraction specialists

Get Started

Core Concepts

Language Reference

Examples

Data Extraction

Overview

Use Case

Complete Code

Key Components

Persona: DataExtractor

Context: ExtractionMode

Anchor: NoGuessing

Custom Types with Validation

Flow: ExtractResume

Usage

Run Extraction

Example Input (Resume)

Example Output

Advanced Patterns

Invoice Extraction

Product Data Extraction

Multi-Document Extraction

Incremental Extraction with Memory

Best Practices

1. Use Probe for Targeted Extraction

2. Validate Required Fields

3. Use Optional Types for Missing Data

4. Apply Range Constraints

5. Use Very Low Temperature

6. Require Text Evidence

Contract Analyzer

Sentiment Analysis

Multi-Step Reasoning

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Reference

Examples

​Overview

​Use Case

​Complete Code

​Key Components

​Persona: DataExtractor

​Context: ExtractionMode

​Anchor: NoGuessing

​Custom Types with Validation

​Flow: ExtractResume

​Usage

​Run Extraction

​Example Input (Resume)

​Example Output

​Advanced Patterns

​Invoice Extraction

​Product Data Extraction

​Multi-Document Extraction

​Incremental Extraction with Memory

​Best Practices

​1. Use Probe for Targeted Extraction

​2. Validate Required Fields

​3. Use Optional Types for Missing Data

​4. Apply Range Constraints

​5. Use Very Low Temperature

​6. Require Text Evidence

​Related Examples

Contract Analyzer

Sentiment Analysis

Multi-Step Reasoning

​Related Documentation

Build docs developers (and LLMs) love

Overview

Use Case

Complete Code

Key Components

Persona: DataExtractor

Context: ExtractionMode

Anchor: NoGuessing

Custom Types with Validation

Flow: ExtractResume

Usage

Run Extraction

Example Input (Resume)

Example Output

Advanced Patterns

Invoice Extraction

Product Data Extraction

Multi-Document Extraction

Incremental Extraction with Memory

Best Practices

1. Use Probe for Targeted Extraction

2. Validate Required Fields

3. Use Optional Types for Missing Data

4. Apply Range Constraints

5. Use Very Low Temperature

6. Require Text Evidence

Related Examples

Related Documentation