Skip to main content

Overview

The most common use case for Zerox is converting PDF documents into markdown format. This example demonstrates basic PDF-to-markdown conversion with proper output structure.

Use Case

Perfect for:
  • Digitizing printed documents
  • Converting academic papers or textbooks
  • Extracting content from reports and presentations
  • Building searchable document repositories
  • Creating markdown versions of legacy PDFs

Basic Example

import { zerox } from "zerox";
import path from "path";
import fs from "fs";

// Convert PDF to markdown
const result = await zerox({
  filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",
  credentials: {
    apiKey: process.env.OPENAI_API_KEY || "",
  },
  // Optional: Save output to file
  outputDir: "./output",
});

// Access the markdown content
console.log("File processed:", result.fileName);
console.log("Total pages:", result.pages.length);
console.log("Processing time:", result.completionTime, "ms");

// Get markdown from all pages
const fullMarkdown = result.pages
  .map((page) => page.content)
  .join("\n\n---\n\n");

console.log(fullMarkdown);

Expected Output Structure

The output will contain structured markdown with properly formatted:
# Document Title

## Section Heading

Paragraph content with **bold** and *italic* text.

### Subsection

- Bullet point lists
- Properly formatted
- With hierarchies

| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| Data 1   | Data 2   | Data 3   |

<page_number>1</page_number>

Output Object

{
  "completionTime": 8432,
  "fileName": "cs101",
  "inputTokens": 36877,
  "outputTokens": 515,
  "pages": [
    {
      "page": 1,
      "content": "# Computer Science 101\n\nThis is the markdown content...",
      "contentLength": 2333
    }
  ],
  "extracted": null,
  "summary": {
    "totalPages": 1,
    "ocr": {
      "successful": 1,
      "failed": 0
    },
    "extracted": null
  }
}

Processing Specific Pages

You can convert only specific pages instead of the entire document:
const result = await zerox({
  filePath: "document.pdf",
  credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
  // Convert only pages 1, 3, and 5
  pagesToConvertAsImages: [1, 3, 5],
});

Tips and Best Practices

Performance: For large documents, increase concurrency to process more pages in parallel (default is 10).
Quality: Use higher-tier models like gpt-4o for better accuracy on complex layouts or documents with mathematical notation.
Consistency: Enable maintainFormat: true if you need consistent formatting across pages, especially for documents with tables spanning multiple pages. Note: this processes pages sequentially and is slower.
Cost Optimization: Use gpt-4o-mini for simple documents to reduce API costs while maintaining good quality.
Local Files: For local files, use absolute paths:
  • Node.js: path.resolve(__dirname, "./document.pdf")
  • Python: os.path.abspath("./document.pdf")

Handling Tables and Charts

Zerox automatically:
  • Converts tables to HTML or markdown table format
  • Interprets charts and infographics into structured data
  • Preserves headers, footers, and page numbers
  • Wraps logos and watermarks in special tags

Error Handling

import { zerox, ErrorMode } from "zerox";

try {
  const result = await zerox({
    filePath: "document.pdf",
    credentials: { apiKey: process.env.OPENAI_API_KEY || "" },
    errorMode: ErrorMode.THROW, // Throw on errors instead of ignoring
  });
  
  // Check for failed pages
  const failedPages = result.pages.filter(
    (page) => page.status === "ERROR"
  );
  
  if (failedPages.length > 0) {
    console.error("Failed pages:", failedPages);
  }
} catch (error) {
  console.error("Conversion failed:", error);
}

Build docs developers (and LLMs) love