Skip to main content

Overview

Zerox can extract structured data from documents using JSON schemas. This is particularly useful for invoices, receipts, forms, and other documents where you need specific fields extracted in a consistent format.

Use Case

Perfect for:
  • Automating accounts payable processing
  • Extracting data from receipts and invoices
  • Processing tax documents
  • Building structured databases from unstructured documents
  • Form data extraction

Complete Example

import { zerox } from "zerox";
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types";

// Define the schema for invoice data extraction
const invoiceSchema = {
  type: "object",
  properties: {
    invoice_number: {
      type: "string",
      description: "The invoice number",
    },
    date: {
      type: "string",
      description: "Invoice date",
    },
    vendor_name: {
      type: "string",
      description: "Name of the vendor/seller",
    },
    customer_name: {
      type: "string",
      description: "Name of the customer/buyer",
    },
    items: {
      type: "array",
      description: "Line items on the invoice",
      items: {
        type: "object",
        properties: {
          description: { type: "string" },
          quantity: { type: "number" },
          unit_price: { type: "number" },
          total: { type: "number" },
        },
        required: ["description", "quantity", "unit_price", "total"],
      },
    },
    subtotal: {
      type: "number",
      description: "Subtotal before tax and discounts",
    },
    tax: {
      type: "number",
      description: "Tax amount",
    },
    discount: {
      type: "number",
      description: "Discount amount",
    },
    total: {
      type: "number",
      description: "Final total amount",
    },
  },
  required: [
    "invoice_number",
    "date",
    "vendor_name",
    "items",
    "total",
  ],
};

// Extract data from invoice
const result = await zerox({
  filePath: "https://omni-demo-data.s3.amazonaws.com/test/invoice.pdf",
  credentials: {
    apiKey: process.env.OPENAI_API_KEY || "",
  },
  model: ModelOptions.OPENAI_GPT_4O,
  modelProvider: ModelProvider.OPENAI,
  schema: invoiceSchema,
  extractOnly: true, // Skip OCR, only extract structured data
});

console.log("Extracted invoice data:", JSON.stringify(result.extracted, null, 2));

Expected Output

{
  "extracted": {
    "invoice_number": "INV-2024-001234",
    "date": "2024-03-05",
    "vendor_name": "Acme Corporation",
    "customer_name": "John Smith",
    "items": [
      {
        "description": "Professional Services - March 2024",
        "quantity": 40,
        "unit_price": 150.00,
        "total": 6000.00
      },
      {
        "description": "Cloud Hosting (Monthly)",
        "quantity": 1,
        "unit_price": 299.99,
        "total": 299.99
      }
    ],
    "subtotal": 6299.99,
    "tax": 629.99,
    "discount": 0,
    "total": 6929.98
  },
  "completionTime": 3245,
  "inputTokens": 1842,
  "outputTokens": 156
}

Extract Different Document Types

Receipt Extraction

const receiptSchema = {
  type: "object",
  properties: {
    merchant_name: { type: "string" },
    date: { type: "string" },
    time: { type: "string" },
    items: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          price: { type: "number" },
        },
      },
    },
    total: { type: "number" },
    payment_method: { type: "string" },
  },
  required: ["merchant_name", "date", "total"],
};

Property Report Extraction

const propertySchema = {
  type: "object",
  properties: {
    commercial_office: {
      type: "object",
      properties: {
        average: { type: "string" },
        median: { type: "string" },
      },
      required: ["average", "median"],
    },
    transactions_by_quarter: {
      type: "array",
      items: {
        type: "object",
        properties: {
          quarter: { type: "string" },
          transactions: { type: "integer" },
        },
        required: ["quarter", "transactions"],
      },
    },
    year: { type: "integer" },
  },
  required: ["commercial_office", "transactions_by_quarter", "year"],
};

Per-Page Extraction

For multi-page documents, extract data from each page separately:
const result = await zerox({
  filePath: "multi-page-invoices.pdf",
  credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
  schema: invoiceSchema,
  extractOnly: true,
  extractPerPage: ["invoice_number", "date", "total"], // Extract these fields per page
});

// Result will have extracted data organized by page
console.log(result.extracted);

Hybrid Extraction

Combine OCR and image-based extraction for better accuracy:
const result = await zerox({
  filePath: "complex-invoice.pdf",
  credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
  schema: invoiceSchema,
  enableHybridExtraction: true, // Uses both markdown and images
});

Custom Extraction Prompts

Guide the extraction with custom instructions:
const result = await zerox({
  filePath: "invoice.pdf",
  credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
  schema: invoiceSchema,
  extractOnly: true,
  extractionPrompt: `
    Extract invoice data with these specific rules:
    - Parse dates in ISO 8601 format (YYYY-MM-DD)
    - Convert all currencies to USD
    - Round all amounts to 2 decimal places
    - Extract line items in order of appearance
  `,
});

Tips and Best Practices

Schema Design: Make required fields only those that MUST be present. Optional fields allow the model to handle variations in document formats.
Field Descriptions: Add detailed descriptions to schema properties to guide the model in understanding what data to extract.
Model Selection: Use gpt-4o for complex invoices with unusual layouts. Use gpt-4o-mini for simple, standard format invoices to save costs.
Direct Image Extraction: Set directImageExtraction: true to extract directly from images instead of OCR text, which can be more accurate for tables and structured data.
Validation: Always validate extracted data against your schema and check for missing required fields before using in production systems.
JSON Schema Standard: Zerox uses the JSON Schema standard. Ensure your schema is valid before extraction.

Using Different Models for Extraction

You can use a different model for extraction than OCR:
const result = await zerox({
  filePath: "invoice.pdf",
  credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
  // Use mini for OCR
  model: ModelOptions.OPENAI_GPT_4O_MINI,
  // Use full model for extraction
  extractionModel: ModelOptions.OPENAI_GPT_4O,
  extractionModelProvider: ModelProvider.OPENAI,
  schema: invoiceSchema,
});

Build docs developers (and LLMs) love