Batch Document Processing

Overview

When you need to process multiple documents, Zerox provides built-in concurrency control and batch processing capabilities. This example shows how to efficiently process many documents in parallel.

Use Case

Perfect for:

Processing document archives
Bulk invoice processing
Migrating document libraries to markdown
Building searchable document databases
Automated document ingestion pipelines

Basic Batch Processing

import { zerox } from "zerox";
import fs from "fs/promises";
import path from "path";

// List of documents to process
const documents = [
  "https://example.com/doc1.pdf",
  "https://example.com/doc2.pdf",
  "./local/doc3.pdf",
  "./local/doc4.pdf",
];

// Process documents in parallel with controlled concurrency
async function batchProcessDocuments() {
  const results = [];
  const errors = [];
  
  // Process all documents
  const promises = documents.map(async (filePath, index) => {
    try {
      console.log(`Processing document ${index + 1}/${documents.length}: ${filePath}`);
      
      const result = await zerox({
        filePath,
        credentials: {
          apiKey: process.env.OPENAI_API_KEY || "",
        },
        concurrency: 5, // Process 5 pages at a time per document
        outputDir: "./output", // Save each document's output
      });
      
      console.log(`✓ Completed ${result.fileName} (${result.pages.length} pages)`);
      
      return {
        success: true,
        fileName: result.fileName,
        filePath,
        pages: result.pages.length,
        inputTokens: result.inputTokens,
        outputTokens: result.outputTokens,
        completionTime: result.completionTime,
      };
    } catch (error) {
      console.error(`✗ Failed to process ${filePath}:`, error);
      return {
        success: false,
        filePath,
        error: error.message,
      };
    }
  });
  
  const allResults = await Promise.all(promises);
  
  // Separate successful and failed results
  const successful = allResults.filter((r) => r.success);
  const failed = allResults.filter((r) => !r.success);
  
  // Print summary
  console.log("\n=== Batch Processing Summary ===");
  console.log(`Total documents: ${documents.length}`);
  console.log(`Successful: ${successful.length}`);
  console.log(`Failed: ${failed.length}`);
  
  if (successful.length > 0) {
    const totalPages = successful.reduce((sum, r) => sum + r.pages, 0);
    const totalTokens = successful.reduce(
      (sum, r) => sum + r.inputTokens + r.outputTokens,
      0
    );
    console.log(`Total pages processed: ${totalPages}`);
    console.log(`Total tokens used: ${totalTokens}`);
  }
  
  if (failed.length > 0) {
    console.log("\nFailed documents:");
    failed.forEach((f) => console.log(`  - ${f.filePath}: ${f.error}`));
  }
  
  return { successful, failed };
}

// Run batch processing
batchProcessDocuments()
  .then(({ successful, failed }) => {
    console.log("\nBatch processing complete!");
    process.exit(failed.length > 0 ? 1 : 0);
  })
  .catch((error) => {
    console.error("Batch processing failed:", error);
    process.exit(1);
  });

Rate-Limited Batch Processing

Control the number of concurrent documents being processed:

import { zerox } from "zerox";
import pLimit from "p-limit";

// Limit to 3 documents processing at once
const limit = pLimit(3);

const documents = [
  /* ... your document list ... */
];

async function rateLimitedBatch() {
  const promises = documents.map((filePath) =>
    limit(async () => {
      return await zerox({
        filePath,
        credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
        concurrency: 5, // Pages per document
      });
    })
  );
  
  return await Promise.all(promises);
}

rateLimitedBatch().then((results) => {
  console.log(`Processed ${results.length} documents`);
});

Processing Local Directory

import { zerox } from "zerox";
import fs from "fs/promises";
import path from "path";

async function processDirectory(dirPath: string) {
  // Find all PDF files in directory
  const files = await fs.readdir(dirPath);
  const pdfFiles = files
    .filter((file) => file.toLowerCase().endsWith(".pdf"))
    .map((file) => path.join(dirPath, file));
  
  console.log(`Found ${pdfFiles.length} PDF files`);
  
  // Process each file
  const results = await Promise.all(
    pdfFiles.map(async (filePath) => {
      try {
        const result = await zerox({
          filePath,
          credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
          outputDir: "./output",
        });
        return { success: true, fileName: result.fileName };
      } catch (error) {
        return { success: false, filePath, error: error.message };
      }
    })
  );
  
  return results;
}

processDirectory("./documents").then((results) => {
  const successful = results.filter((r) => r.success).length;
  console.log(`Successfully processed ${successful}/${results.length} documents`);
});

Progress Tracking

import { zerox } from "zerox";

class ProgressTracker {
  constructor(total: number) {
    this.total = total;
    this.completed = 0;
    this.failed = 0;
  }
  
  total: number;
  completed: number;
  failed: number;
  
  update(success: boolean) {
    if (success) {
      this.completed++;
    } else {
      this.failed++;
    }
    this.print();
  }
  
  print() {
    const processed = this.completed + this.failed;
    const percentage = ((processed / this.total) * 100).toFixed(1);
    console.log(
      `Progress: ${processed}/${this.total} (${percentage}%) - ` +
        `✓ ${this.completed} successful, ✗ ${this.failed} failed`
    );
  }
}

async function batchWithProgress(documents: string[]) {
  const tracker = new ProgressTracker(documents.length);
  
  const promises = documents.map(async (filePath) => {
    try {
      await zerox({
        filePath,
        credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
      });
      tracker.update(true);
    } catch (error) {
      tracker.update(false);
    }
  });
  
  await Promise.all(promises);
}

Batch Extraction

Extract structured data from multiple documents:

import { zerox } from "zerox";

const invoiceSchema = {
  type: "object",
  properties: {
    invoice_number: { type: "string" },
    total: { type: "number" },
    date: { type: "string" },
  },
};

async function batchExtraction(documents: string[]) {
  const extractedData = await Promise.all(
    documents.map(async (filePath) => {
      const result = await zerox({
        filePath,
        credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
        schema: invoiceSchema,
        extractOnly: true,
      });
      
      return {
        filePath,
        data: result.extracted,
      };
    })
  );
  
  return extractedData;
}

Tips and Best Practices

Concurrency Tuning: Adjust concurrency parameter based on your API rate limits and available resources. Higher values process faster but use more tokens simultaneously.

Error Recovery: Implement retry logic for failed documents. Network issues and temporary API problems can be resolved with retries.

Cost Management: Use gpt-4o-mini for batch processing to significantly reduce costs while maintaining good quality for most documents.

Memory Management: Set cleanup: true (default) to automatically remove temporary files after processing each document.

Progress Persistence: Save progress to disk so you can resume batch jobs if they’re interrupted. Store which files have been processed and skip them on restart.

API Rate Limits: Be mindful of your API provider’s rate limits. Use rate limiting (shown above) to avoid hitting limits when processing many documents.

Estimated Processing Times

Small documents (1-5 pages): 2-5 seconds per document
Medium documents (10-20 pages): 10-30 seconds per document
Large documents (50+ pages): 1-3 minutes per document

With concurrency: 10, you can process approximately:

100 small documents in ~5 minutes
100 medium documents in ~20 minutes
100 large documents in ~2 hours

Cost Estimation

Using gpt-4o-mini (typical costs per document):

Small document: $0.01 -$ 0.03
Medium document: $0.05 -$ 0.10
Large document: $0.20 -$ 0.50

Using gpt-4o:

Small document: $0.10 -$ 0.20
Medium document: $0.40 -$ 0.80
Large document: $1.50 -$ 3.00

Common Use Cases

Advanced

Batch Document Processing

Overview

Use Case

Basic Batch Processing

Rate-Limited Batch Processing

Processing Local Directory

Progress Tracking

Batch Extraction

Tips and Best Practices

Estimated Processing Times

Cost Estimation

Build docs developers (and LLMs) love

Common Use Cases

Advanced

​Overview

​Use Case

​Basic Batch Processing

​Rate-Limited Batch Processing

​Processing Local Directory

​Progress Tracking

​Batch Extraction

​Tips and Best Practices

​Estimated Processing Times

​Cost Estimation

Build docs developers (and LLMs) love

Overview

Use Case

Basic Batch Processing

Rate-Limited Batch Processing

Processing Local Directory

Progress Tracking

Batch Extraction

Tips and Best Practices

Estimated Processing Times

Cost Estimation