Overview
When you need to process multiple documents, Zerox provides built-in concurrency control and batch processing capabilities. This example shows how to efficiently process many documents in parallel.
Use Case
Perfect for:
- Processing document archives
- Bulk invoice processing
- Migrating document libraries to markdown
- Building searchable document databases
- Automated document ingestion pipelines
Basic Batch Processing
import { zerox } from "zerox";
import fs from "fs/promises";
import path from "path";
// List of documents to process
const documents = [
"https://example.com/doc1.pdf",
"https://example.com/doc2.pdf",
"./local/doc3.pdf",
"./local/doc4.pdf",
];
// Process documents in parallel with controlled concurrency
async function batchProcessDocuments() {
const results = [];
const errors = [];
// Process all documents
const promises = documents.map(async (filePath, index) => {
try {
console.log(`Processing document ${index + 1}/${documents.length}: ${filePath}`);
const result = await zerox({
filePath,
credentials: {
apiKey: process.env.OPENAI_API_KEY || "",
},
concurrency: 5, // Process 5 pages at a time per document
outputDir: "./output", // Save each document's output
});
console.log(`✓ Completed ${result.fileName} (${result.pages.length} pages)`);
return {
success: true,
fileName: result.fileName,
filePath,
pages: result.pages.length,
inputTokens: result.inputTokens,
outputTokens: result.outputTokens,
completionTime: result.completionTime,
};
} catch (error) {
console.error(`✗ Failed to process ${filePath}:`, error);
return {
success: false,
filePath,
error: error.message,
};
}
});
const allResults = await Promise.all(promises);
// Separate successful and failed results
const successful = allResults.filter((r) => r.success);
const failed = allResults.filter((r) => !r.success);
// Print summary
console.log("\n=== Batch Processing Summary ===");
console.log(`Total documents: ${documents.length}`);
console.log(`Successful: ${successful.length}`);
console.log(`Failed: ${failed.length}`);
if (successful.length > 0) {
const totalPages = successful.reduce((sum, r) => sum + r.pages, 0);
const totalTokens = successful.reduce(
(sum, r) => sum + r.inputTokens + r.outputTokens,
0
);
console.log(`Total pages processed: ${totalPages}`);
console.log(`Total tokens used: ${totalTokens}`);
}
if (failed.length > 0) {
console.log("\nFailed documents:");
failed.forEach((f) => console.log(` - ${f.filePath}: ${f.error}`));
}
return { successful, failed };
}
// Run batch processing
batchProcessDocuments()
.then(({ successful, failed }) => {
console.log("\nBatch processing complete!");
process.exit(failed.length > 0 ? 1 : 0);
})
.catch((error) => {
console.error("Batch processing failed:", error);
process.exit(1);
});
Rate-Limited Batch Processing
Control the number of concurrent documents being processed:
import { zerox } from "zerox";
import pLimit from "p-limit";
// Limit to 3 documents processing at once
const limit = pLimit(3);
const documents = [
/* ... your document list ... */
];
async function rateLimitedBatch() {
const promises = documents.map((filePath) =>
limit(async () => {
return await zerox({
filePath,
credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
concurrency: 5, // Pages per document
});
})
);
return await Promise.all(promises);
}
rateLimitedBatch().then((results) => {
console.log(`Processed ${results.length} documents`);
});
Processing Local Directory
import { zerox } from "zerox";
import fs from "fs/promises";
import path from "path";
async function processDirectory(dirPath: string) {
// Find all PDF files in directory
const files = await fs.readdir(dirPath);
const pdfFiles = files
.filter((file) => file.toLowerCase().endsWith(".pdf"))
.map((file) => path.join(dirPath, file));
console.log(`Found ${pdfFiles.length} PDF files`);
// Process each file
const results = await Promise.all(
pdfFiles.map(async (filePath) => {
try {
const result = await zerox({
filePath,
credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
outputDir: "./output",
});
return { success: true, fileName: result.fileName };
} catch (error) {
return { success: false, filePath, error: error.message };
}
})
);
return results;
}
processDirectory("./documents").then((results) => {
const successful = results.filter((r) => r.success).length;
console.log(`Successfully processed ${successful}/${results.length} documents`);
});
Progress Tracking
import { zerox } from "zerox";
class ProgressTracker {
constructor(total: number) {
this.total = total;
this.completed = 0;
this.failed = 0;
}
total: number;
completed: number;
failed: number;
update(success: boolean) {
if (success) {
this.completed++;
} else {
this.failed++;
}
this.print();
}
print() {
const processed = this.completed + this.failed;
const percentage = ((processed / this.total) * 100).toFixed(1);
console.log(
`Progress: ${processed}/${this.total} (${percentage}%) - ` +
`✓ ${this.completed} successful, ✗ ${this.failed} failed`
);
}
}
async function batchWithProgress(documents: string[]) {
const tracker = new ProgressTracker(documents.length);
const promises = documents.map(async (filePath) => {
try {
await zerox({
filePath,
credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
});
tracker.update(true);
} catch (error) {
tracker.update(false);
}
});
await Promise.all(promises);
}
Extract structured data from multiple documents:
import { zerox } from "zerox";
const invoiceSchema = {
type: "object",
properties: {
invoice_number: { type: "string" },
total: { type: "number" },
date: { type: "string" },
},
};
async function batchExtraction(documents: string[]) {
const extractedData = await Promise.all(
documents.map(async (filePath) => {
const result = await zerox({
filePath,
credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
schema: invoiceSchema,
extractOnly: true,
});
return {
filePath,
data: result.extracted,
};
})
);
return extractedData;
}
Tips and Best Practices
Concurrency Tuning: Adjust concurrency parameter based on your API rate limits and available resources. Higher values process faster but use more tokens simultaneously.
Error Recovery: Implement retry logic for failed documents. Network issues and temporary API problems can be resolved with retries.
Cost Management: Use gpt-4o-mini for batch processing to significantly reduce costs while maintaining good quality for most documents.
Memory Management: Set cleanup: true (default) to automatically remove temporary files after processing each document.
Progress Persistence: Save progress to disk so you can resume batch jobs if they’re interrupted. Store which files have been processed and skip them on restart.
API Rate Limits: Be mindful of your API provider’s rate limits. Use rate limiting (shown above) to avoid hitting limits when processing many documents.
Estimated Processing Times
- Small documents (1-5 pages): 2-5 seconds per document
- Medium documents (10-20 pages): 10-30 seconds per document
- Large documents (50+ pages): 1-3 minutes per document
With concurrency: 10, you can process approximately:
- 100 small documents in ~5 minutes
- 100 medium documents in ~20 minutes
- 100 large documents in ~2 hours
Cost Estimation
Using gpt-4o-mini (typical costs per document):
- Small document: 0.01−0.03
- Medium document: 0.05−0.10
- Large document: 0.20−0.50
Using gpt-4o:
- Small document: 0.10−0.20
- Medium document: 0.40−0.80
- Large document: 1.50−3.00