Zerox allows you to extract structured data from documents by providing a JSON schema. This is useful when you need to extract specific information like invoices, receipts, forms, or any structured data rather than just converting the document to markdown.
For multi-page documents, you can use extractPerPage to specify which fields should be extracted from each page individually, while other fields are extracted from the entire document.
This is useful when certain information appears on every page (like line items) while other information appears once in the document (like invoice number).
import { zerox } from 'zerox';const result = await zerox({ filePath: './multi-page-invoice.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, schema: { type: 'object', properties: { // Document-level fields (extracted once from all pages) invoiceNumber: { type: 'string', description: 'The invoice number' }, vendorName: { type: 'string', description: 'Name of the vendor' }, // Page-level fields (extracted from each page) lineItems: { type: 'array', description: 'Items found on this page', items: { type: 'object', properties: { name: { type: 'string' }, quantity: { type: 'number' }, price: { type: 'number' } } } } } }, // Extract lineItems from each page individually extractPerPage: ['lineItems']});// Result structure:console.log(result.extracted);// {// invoiceNumber: 'INV-2024-001', // Extracted once from all pages// vendorName: 'Acme Corp', // Extracted once from all pages// lineItems: [ // Extracted per page// { page: 1, value: [...] },// { page: 2, value: [...] },// { page: 3, value: [...] }// ]// }
When you only need extracted data and don’t need the OCR markdown output, use extractOnly mode. This skips the OCR step and directly processes the images for extraction, which is faster and uses fewer tokens.
Extract-only mode requires a schema. It cannot be used with maintainFormat as it skips the OCR step entirely.
You can customize the extraction prompt to guide the model:
import { zerox } from 'zerox';const result = await zerox({ filePath: './medical-form.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, schema: { type: 'object', properties: { patientName: { type: 'string' }, dateOfBirth: { type: 'string' }, diagnosis: { type: 'string' } } }, extractionPrompt: `Extract medical information from this form. Use YYYY-MM-DD format for dates. If a field is unclear or missing, return null.`});