Overview
Zerox can extract structured data from documents by providing a JSON schema. This feature uses vision models to directly extract information without first converting to markdown.
Provide a JSON schema to extract specific fields:
import { zerox } from 'zerox-ocr';
const result = await zerox({
filePath: 'invoice.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
schema: {
type: "object",
properties: {
invoiceNumber: { type: "string" },
invoiceDate: { type: "string" },
totalAmount: { type: "number" },
vendor: { type: "string" }
}
}
});
// Access extracted data
console.log(result.extracted);
// { invoiceNumber: "INV-001", invoiceDate: "2024-03-15", ... }
from zerox import zerox
result = await zerox(
file_path="invoice.pdf",
credentials={"api_key": os.getenv("OPENAI_API_KEY")},
schema={
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string"},
"total_amount": {"type": "number"},
"vendor": {"type": "string"}
}
}
)
print(result.extracted)
Schema Parameters
JSON Schema object defining the structure to extract. Supports:
- Primitive types:
string, number, boolean
- Complex types:
object, array
- Nested structures
Custom prompt to guide the extraction process
By default, extraction analyzes the entire document and returns a single result:
const result = await zerox({
filePath: 'contract.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
schema: {
type: "object",
properties: {
parties: {
type: "array",
items: { type: "string" }
},
effectiveDate: { type: "string" },
terms: { type: "string" }
}
}
});
console.log(result.extracted);
// { parties: ["Company A", "Company B"], effectiveDate: "2024-01-01", ... }
Use extractPerPage to extract specific fields from each page individually:
const result = await zerox({
filePath: 'multi-page-invoice.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
schema: {
type: "object",
properties: {
// Full document fields
invoiceNumber: { type: "string" },
totalAmount: { type: "number" },
// Per-page fields
lineItems: {
type: "array",
items: {
type: "object",
properties: {
description: { type: "string" },
quantity: { type: "number" },
price: { type: "number" }
}
}
}
}
},
extractPerPage: ['lineItems'] // Extract lineItems from each page
});
// Result structure
console.log(result.extracted);
/*
{
invoiceNumber: "INV-001", // Full document
totalAmount: 1500.00, // Full document
lineItems: [ // Per-page extraction
{ page: 1, value: [...] },
{ page: 2, value: [...] },
{ page: 3, value: [...] }
]
}
*/
Array of schema property names to extract per-page rather than from the full document. Each matching field will return an array with page and value for each page.
Skip OCR and only perform extraction by setting extractOnly: true:
const result = await zerox({
filePath: 'form.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
extractOnly: true, // Skip OCR, only extract data
schema: {
type: "object",
properties: {
firstName: { type: "string" },
lastName: { type: "string" },
email: { type: "string" },
phoneNumber: { type: "string" }
}
}
});
// result.pages will be empty
// result.extracted contains the data
console.log(result.extracted);
When true, skips OCR and only performs extraction. Requires schema to be provided. The pages array will be empty and only extracted will be populated.
In extract-only mode, directImageExtraction is automatically enabled, meaning the vision model directly analyzes images rather than extracted text.
By default, extraction works on OCR text. Enable direct image extraction to analyze images directly:
const result = await zerox({
filePath: 'infographic.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
directImageExtraction: true, // Extract from images directly
schema: {
type: "object",
properties: {
chartTitle: { type: "string" },
dataPoints: {
type: "array",
items: {
type: "object",
properties: {
label: { type: "string" },
value: { type: "number" }
}
}
}
}
}
});
When true, extraction analyzes images directly rather than OCR text. Useful for charts, infographics, and visual data.
Combine OCR text and images for best accuracy:
const result = await zerox({
filePath: 'report.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
enableHybridExtraction: true, // Use both text and images
schema: {
type: "object",
properties: {
title: { type: "string" },
summary: { type: "string" },
charts: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
data: { type: "string" }
}
}
}
}
}
});
When true, provides both OCR text and images to the extraction model. Requires schema to be provided. Cannot be used with directImageExtraction or extractOnly.
Use different models and credentials for extraction:
const result = await zerox({
filePath: 'document.pdf',
// OCR model
model: 'gpt-4o-mini',
credentials: { apiKey: process.env.OPENAI_API_KEY },
// Extraction model (more powerful)
extractionModel: 'gpt-4o',
extractionCredentials: { apiKey: process.env.OPENAI_API_KEY },
extractionLlmParams: {
temperature: 0.1,
maxTokens: 4000
},
schema: {
type: "object",
properties: {
summary: { type: "string" },
keyPoints: {
type: "array",
items: { type: "string" }
}
}
}
});
Model to use for extraction. Defaults to the same as model.
Provider for extraction model. Defaults to the same as modelProvider.
Credentials for extraction model. Defaults to the same as credentials.
LLM parameters for extraction. Defaults to the same as llmParams.
Complex Schema Example
const result = await zerox({
filePath: 'financial-report.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
schema: {
type: "object",
properties: {
company: {
type: "object",
properties: {
name: { type: "string" },
fiscalYear: { type: "string" },
sector: { type: "string" }
}
},
financials: {
type: "object",
properties: {
revenue: { type: "number" },
netIncome: { type: "number" },
eps: { type: "number" }
}
},
risks: {
type: "array",
items: {
type: "object",
properties: {
category: { type: "string" },
description: { type: "string" },
severity: {
type: "string",
enum: ["low", "medium", "high"]
}
}
}
}
},
required: ["company", "financials"]
}
});
console.log(result.extracted);
Provide guidance for the extraction process:
const result = await zerox({
filePath: 'medical-record.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
extractionPrompt: `
Extract patient information with attention to:
- Date formats should be YYYY-MM-DD
- Medication names should be generic (not brand names)
- Dosages should include units (mg, ml, etc.)
`,
schema: {
type: "object",
properties: {
patientId: { type: "string" },
visitDate: { type: "string" },
medications: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
dosage: { type: "string" },
frequency: { type: "string" }
}
}
}
}
}
});
The extraction results are available in result.extracted and also in the summary:
const result = await zerox({
filePath: 'document.pdf',
credentials: { apiKey: process.env.OPENAI_API_KEY },
schema: { /* ... */ }
});
// Extracted data
console.log(result.extracted);
// Extraction statistics
console.log(result.summary.extracted);
/*
{
successful: 5, // Number of successful extraction requests
failed: 0 // Number of failed extraction requests
}
*/
Next Steps