Schema Extraction

Overview

Zerox allows you to extract structured data from documents by providing a JSON schema. This is useful when you need to extract specific information like invoices, receipts, forms, or any structured data rather than just converting the document to markdown.

Basic Schema Extraction

To extract structured data, provide a JSON schema that defines the structure of the data you want to extract:

import { zerox } from 'zerox';

const result = await zerox({
  filePath: './invoice.pdf',
  openaiAPIKey: process.env.OPENAI_API_KEY,
  schema: {
    type: 'object',
    properties: {
      invoiceNumber: {
        type: 'string',
        description: 'The invoice number'
      },
      date: {
        type: 'string',
        description: 'Invoice date in YYYY-MM-DD format'
      },
      total: {
        type: 'number',
        description: 'Total amount'
      },
      items: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            description: { type: 'string' },
            quantity: { type: 'number' },
            price: { type: 'number' }
          }
        }
      }
    },
    required: ['invoiceNumber', 'total']
  }
});

console.log(result.extracted);
// {
//   invoiceNumber: 'INV-2024-001',
//   date: '2024-03-05',
//   total: 1250.00,
//   items: [
//     { description: 'Product A', quantity: 2, price: 500.00 },
//     { description: 'Product B', quantity: 1, price: 250.00 }
//   ]
// }

Per-Page Extraction with extractPerPage

For multi-page documents, you can use extractPerPage to specify which fields should be extracted from each page individually, while other fields are extracted from the entire document.

This is useful when certain information appears on every page (like line items) while other information appears once in the document (like invoice number).

import { zerox } from 'zerox';

const result = await zerox({
  filePath: './multi-page-invoice.pdf',
  openaiAPIKey: process.env.OPENAI_API_KEY,
  schema: {
    type: 'object',
    properties: {
      // Document-level fields (extracted once from all pages)
      invoiceNumber: {
        type: 'string',
        description: 'The invoice number'
      },
      vendorName: {
        type: 'string',
        description: 'Name of the vendor'
      },
      // Page-level fields (extracted from each page)
      lineItems: {
        type: 'array',
        description: 'Items found on this page',
        items: {
          type: 'object',
          properties: {
            name: { type: 'string' },
            quantity: { type: 'number' },
            price: { type: 'number' }
          }
        }
      }
    }
  },
  // Extract lineItems from each page individually
  extractPerPage: ['lineItems']
});

// Result structure:
console.log(result.extracted);
// {
//   invoiceNumber: 'INV-2024-001',  // Extracted once from all pages
//   vendorName: 'Acme Corp',         // Extracted once from all pages
//   lineItems: [                     // Extracted per page
//     { page: 1, value: [...] },
//     { page: 2, value: [...] },
//     { page: 3, value: [...] }
//   ]
// }

Extract-Only Mode

When you only need extracted data and don’t need the OCR markdown output, use extractOnly mode. This skips the OCR step and directly processes the images for extraction, which is faster and uses fewer tokens.

Extract-only mode requires a schema. It cannot be used with maintainFormat as it skips the OCR step entirely.

import { zerox } from 'zerox';

const result = await zerox({
  filePath: './receipt.pdf',
  openaiAPIKey: process.env.OPENAI_API_KEY,
  extractOnly: true,  // Skip OCR, only extract
  schema: {
    type: 'object',
    properties: {
      merchant: { type: 'string' },
      date: { type: 'string' },
      total: { type: 'number' },
      paymentMethod: { type: 'string' }
    },
    required: ['merchant', 'total']
  }
});

// result.pages will be empty
// result.extracted contains the structured data
console.log(result.extracted);

When to Use Extract-Only Mode

Use extract-only mode when:

You only need structured data, not the full text
Processing receipts, forms, or simple documents
You want to minimize token usage and cost
You need faster processing times

Don’t use extract-only mode when:

You need both markdown output and extracted data
You want to use maintainFormat for tables
You need to preserve document formatting

Custom Extraction Prompts

You can customize the extraction prompt to guide the model:

import { zerox } from 'zerox';

const result = await zerox({
  filePath: './medical-form.pdf',
  openaiAPIKey: process.env.OPENAI_API_KEY,
  schema: {
    type: 'object',
    properties: {
      patientName: { type: 'string' },
      dateOfBirth: { type: 'string' },
      diagnosis: { type: 'string' }
    }
  },
  extractionPrompt: `Extract medical information from this form. 
    Use YYYY-MM-DD format for dates. 
    If a field is unclear or missing, return null.`
});

Schema Best Practices

Use Descriptive Field Names and Descriptions

Clear descriptions help the model understand what to extract:

schema: {
  type: 'object',
  properties: {
    effectiveDate: {
      type: 'string',
      description: 'The date when the contract becomes effective, in YYYY-MM-DD format'
    },
    // Better than just: date: { type: 'string' }
  }
}

Specify Required Fields

Use the required array to ensure critical fields are extracted:

schema: {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string' },
    total: { type: 'number' },
    notes: { type: 'string' }
  },
  required: ['invoiceNumber', 'total']  // notes is optional
}

Use Appropriate Data Types

Match the JSON schema type to your data:

properties: {
  quantity: { type: 'number' },      // For numeric values
  isActive: { type: 'boolean' },     // For true/false
  tags: {                            // For arrays
    type: 'array',
    items: { type: 'string' }
  },
  metadata: { type: 'object' }       // For nested objects
}

Error Handling

Extraction may fail if the schema is invalid or the document doesn’t contain the expected data. Always check the result:

const result = await zerox({
  filePath: './document.pdf',
  openaiAPIKey: process.env.OPENAI_API_KEY,
  schema: mySchema,
  errorMode: 'throw'  // Throws errors instead of ignoring them
});

if (result.summary.extracted) {
  console.log(`Successful extractions: ${result.summary.extracted.successful}`);
  console.log(`Failed extractions: ${result.summary.extracted.failed}`);
}

if (result.extracted) {
  // Process extracted data
} else {
  console.error('Extraction failed');
}

Common Use Cases

Advanced

Schema Extraction

Overview

Basic Schema Extraction

Per-Page Extraction with extractPerPage

Extract-Only Mode

When to Use Extract-Only Mode

Custom Extraction Prompts

Schema Best Practices

Use Descriptive Field Names and Descriptions

Specify Required Fields

Use Appropriate Data Types

Error Handling

Next Steps

Hybrid Extraction

Maintain Format

Build docs developers (and LLMs) love

Common Use Cases

Advanced

​Overview

​Basic Schema Extraction

​Per-Page Extraction with extractPerPage

​Extract-Only Mode

​When to Use Extract-Only Mode

​Custom Extraction Prompts

​Schema Best Practices

​Use Descriptive Field Names and Descriptions

​Specify Required Fields

​Use Appropriate Data Types

​Error Handling

​Next Steps

Hybrid Extraction

Maintain Format

Build docs developers (and LLMs) love

Overview

Basic Schema Extraction

Per-Page Extraction with extractPerPage

Extract-Only Mode

When to Use Extract-Only Mode

Custom Extraction Prompts

Schema Best Practices

Use Descriptive Field Names and Descriptions

Specify Required Fields

Use Appropriate Data Types

Error Handling

Next Steps