The maintainFormat option ensures consistent formatting when processing multi-page tables that span across multiple pages. This is crucial for documents where a table starts on one page and continues on subsequent pages.
Performance Impact: maintainFormat processes pages sequentially (one at a time) rather than in parallel, making it slower than default processing. Only use this when you need consistent table formatting across pages.
Ideal for documents where tables continue across multiple pages:
const result = await zerox({ filePath: './financial-statement.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, maintainFormat: true});// Without maintainFormat, each page might format the table differently// With maintainFormat, all pages use the same column structure
Balance sheets, income statements, or reports with structured data:
const result = await zerox({ filePath: './quarterly-report.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, maintainFormat: true});// Preserves the hierarchical structure and formatting of financial tables
// ❌ Cannot use with extractOnlyconst result = await zerox({ filePath: './table.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, extractOnly: true, maintainFormat: true, // Error: incompatible schema: mySchema});// Error: Maintain format is only supported in OCR mode
const start = Date.now();const result = await zerox({ filePath: './10-page-table.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, concurrency: 5 // Processes 5 pages at a time});console.log(`Completed in ${result.completionTime}ms`);// Typical: 15-30 seconds for 10 pages
With maintainFormat (sequential):
const start = Date.now();const result = await zerox({ filePath: './10-page-table.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, maintainFormat: true // Processes pages one at a time // concurrency setting is ignored when maintainFormat is true});console.log(`Completed in ${result.completionTime}ms`);// Typical: 45-90 seconds for 10 pages (3-5x slower)
The concurrency parameter is ignored when maintainFormat is enabled because pages must be processed sequentially to maintain context.
You can use maintainFormat with schema extraction for structured data from formatted tables:
import { zerox } from 'zerox';const result = await zerox({ filePath: './product-catalog.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, maintainFormat: true, // Ensures consistent table formatting schema: { type: 'object', properties: { products: { type: 'array', description: 'All products from the catalog', items: { type: 'object', properties: { sku: { type: 'string' }, name: { type: 'string' }, price: { type: 'number' }, stock: { type: 'number' } } } } } }});// The OCR output (result.pages) has consistent table formatting// AND you get structured data (result.extracted)console.log('Total products:', result.extracted.products.length);
When combining maintainFormat with schema extraction, the OCR step maintains formatting while the extraction step pulls structured data. This gives you both well-formatted markdown and structured JSON.
With sequential processing, if one page fails, subsequent pages are not processed:
import { zerox, ErrorMode } from 'zerox';const result = await zerox({ filePath: './table.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, maintainFormat: true, errorMode: ErrorMode.IGNORE, // Continue even if a page fails maxRetries: 3 // Retry failed pages});// Check which pages succeededresult.pages.forEach(page => { if (page.status === 'ERROR') { console.error(`Page ${page.page} failed: ${page.error}`); }});
When maintainFormat is enabled and a page fails, processing stops at that page to avoid formatting inconsistencies. Use errorMode: ErrorMode.IGNORE and maxRetries to handle failures gracefully.
For documents with isolated tables on each page, use default processing:
// ❌ Slow: Each page is independentconst result = await zerox({ filePath: './separate-invoices.pdf', // Each page is a separate invoice openaiAPIKey: process.env.OPENAI_API_KEY, maintainFormat: true // Unnecessary});// ✅ Fast: Process in parallelconst result = await zerox({ filePath: './separate-invoices.pdf', openaiAPIKey: process.env.OPENAI_API_KEY, concurrency: 10});