Skip to main content

Overview

Zerox can extract structured data from documents by providing a JSON schema. This feature uses vision models to directly extract information without first converting to markdown.

Basic Extraction

Provide a JSON schema to extract specific fields:
import { zerox } from 'zerox-ocr';

const result = await zerox({
  filePath: 'invoice.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  schema: {
    type: "object",
    properties: {
      invoiceNumber: { type: "string" },
      invoiceDate: { type: "string" },
      totalAmount: { type: "number" },
      vendor: { type: "string" }
    }
  }
});

// Access extracted data
console.log(result.extracted);
// { invoiceNumber: "INV-001", invoiceDate: "2024-03-15", ... }

Schema Parameters

schema
object
JSON Schema object defining the structure to extract. Supports:
  • Primitive types: string, number, boolean
  • Complex types: object, array
  • Nested structures
extractionPrompt
string
Custom prompt to guide the extraction process

Extraction Modes

Full Document Extraction (Default)

By default, extraction analyzes the entire document and returns a single result:
const result = await zerox({
  filePath: 'contract.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  schema: {
    type: "object",
    properties: {
      parties: {
        type: "array",
        items: { type: "string" }
      },
      effectiveDate: { type: "string" },
      terms: { type: "string" }
    }
  }
});

console.log(result.extracted);
// { parties: ["Company A", "Company B"], effectiveDate: "2024-01-01", ... }

Per-Page Extraction

Use extractPerPage to extract specific fields from each page individually:
const result = await zerox({
  filePath: 'multi-page-invoice.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  schema: {
    type: "object",
    properties: {
      // Full document fields
      invoiceNumber: { type: "string" },
      totalAmount: { type: "number" },
      
      // Per-page fields
      lineItems: {
        type: "array",
        items: {
          type: "object",
          properties: {
            description: { type: "string" },
            quantity: { type: "number" },
            price: { type: "number" }
          }
        }
      }
    }
  },
  extractPerPage: ['lineItems']  // Extract lineItems from each page
});

// Result structure
console.log(result.extracted);
/*
{
  invoiceNumber: "INV-001",  // Full document
  totalAmount: 1500.00,       // Full document
  lineItems: [                // Per-page extraction
    { page: 1, value: [...] },
    { page: 2, value: [...] },
    { page: 3, value: [...] }
  ]
}
*/
extractPerPage
string[]
Array of schema property names to extract per-page rather than from the full document. Each matching field will return an array with page and value for each page.

Extract-Only Mode

Skip OCR and only perform extraction by setting extractOnly: true:
const result = await zerox({
  filePath: 'form.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  extractOnly: true,  // Skip OCR, only extract data
  schema: {
    type: "object",
    properties: {
      firstName: { type: "string" },
      lastName: { type: "string" },
      email: { type: "string" },
      phoneNumber: { type: "string" }
    }
  }
});

// result.pages will be empty
// result.extracted contains the data
console.log(result.extracted);
extractOnly
boolean
default:false
When true, skips OCR and only performs extraction. Requires schema to be provided. The pages array will be empty and only extracted will be populated.
In extract-only mode, directImageExtraction is automatically enabled, meaning the vision model directly analyzes images rather than extracted text.

Direct Image Extraction

By default, extraction works on OCR text. Enable direct image extraction to analyze images directly:
const result = await zerox({
  filePath: 'infographic.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  directImageExtraction: true,  // Extract from images directly
  schema: {
    type: "object",
    properties: {
      chartTitle: { type: "string" },
      dataPoints: {
        type: "array",
        items: {
          type: "object",
          properties: {
            label: { type: "string" },
            value: { type: "number" }
          }
        }
      }
    }
  }
});
directImageExtraction
boolean
default:false
When true, extraction analyzes images directly rather than OCR text. Useful for charts, infographics, and visual data.

Hybrid Extraction

Combine OCR text and images for best accuracy:
const result = await zerox({
  filePath: 'report.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  enableHybridExtraction: true,  // Use both text and images
  schema: {
    type: "object",
    properties: {
      title: { type: "string" },
      summary: { type: "string" },
      charts: {
        type: "array",
        items: {
          type: "object",
          properties: {
            name: { type: "string" },
            data: { type: "string" }
          }
        }
      }
    }
  }
});
enableHybridExtraction
boolean
default:false
When true, provides both OCR text and images to the extraction model. Requires schema to be provided. Cannot be used with directImageExtraction or extractOnly.

Custom Extraction Models

Use different models and credentials for extraction:
const result = await zerox({
  filePath: 'document.pdf',
  
  // OCR model
  model: 'gpt-4o-mini',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  
  // Extraction model (more powerful)
  extractionModel: 'gpt-4o',
  extractionCredentials: { apiKey: process.env.OPENAI_API_KEY },
  extractionLlmParams: {
    temperature: 0.1,
    maxTokens: 4000
  },
  
  schema: {
    type: "object",
    properties: {
      summary: { type: "string" },
      keyPoints: {
        type: "array",
        items: { type: "string" }
      }
    }
  }
});
extractionModel
string
Model to use for extraction. Defaults to the same as model.
extractionModelProvider
string
Provider for extraction model. Defaults to the same as modelProvider.
extractionCredentials
object
Credentials for extraction model. Defaults to the same as credentials.
extractionLlmParams
object
LLM parameters for extraction. Defaults to the same as llmParams.

Complex Schema Example

const result = await zerox({
  filePath: 'financial-report.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  schema: {
    type: "object",
    properties: {
      company: {
        type: "object",
        properties: {
          name: { type: "string" },
          fiscalYear: { type: "string" },
          sector: { type: "string" }
        }
      },
      financials: {
        type: "object",
        properties: {
          revenue: { type: "number" },
          netIncome: { type: "number" },
          eps: { type: "number" }
        }
      },
      risks: {
        type: "array",
        items: {
          type: "object",
          properties: {
            category: { type: "string" },
            description: { type: "string" },
            severity: {
              type: "string",
              enum: ["low", "medium", "high"]
            }
          }
        }
      }
    },
    required: ["company", "financials"]
  }
});

console.log(result.extracted);

Custom Extraction Prompt

Provide guidance for the extraction process:
const result = await zerox({
  filePath: 'medical-record.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  extractionPrompt: `
    Extract patient information with attention to:
    - Date formats should be YYYY-MM-DD
    - Medication names should be generic (not brand names)
    - Dosages should include units (mg, ml, etc.)
  `,
  schema: {
    type: "object",
    properties: {
      patientId: { type: "string" },
      visitDate: { type: "string" },
      medications: {
        type: "array",
        items: {
          type: "object",
          properties: {
            name: { type: "string" },
            dosage: { type: "string" },
            frequency: { type: "string" }
          }
        }
      }
    }
  }
});

Extraction Output

The extraction results are available in result.extracted and also in the summary:
const result = await zerox({
  filePath: 'document.pdf',
  credentials: { apiKey: process.env.OPENAI_API_KEY },
  schema: { /* ... */ }
});

// Extracted data
console.log(result.extracted);

// Extraction statistics
console.log(result.summary.extracted);
/*
{
  successful: 5,  // Number of successful extraction requests
  failed: 0       // Number of failed extraction requests
}
*/

Next Steps

Build docs developers (and LLMs) love