Skip to main content

What is Zerox?

Zerox is a vision-powered OCR SDK that converts documents into clean Markdown format for AI ingestion. Instead of traditional OCR engines, Zerox leverages vision language models (GPT-4 Vision, Claude, Gemini, etc.) to understand document layouts, tables, charts, and complex formatting—making it ideal for AI applications that need structured document data.

Try the Demo

Test Zerox with your documents in the hosted demo

How It Works

Zerox follows a simple four-step process:
1

Pass in a file

Provide a PDF, DOCX, image, or other supported document format via file path or URL
2

Convert to images

Zerox converts your document into a series of high-quality images (one per page)
3

Extract with vision models

Each image is sent to your chosen vision model (GPT-4o, Claude, Gemini, etc.) with instructions to extract Markdown
4

Aggregate and return

The Markdown responses are aggregated and returned with metadata (tokens, completion time, etc.)

Key Features

Multi-Format Support

Process PDFs, Word docs, images (PNG, JPEG, HEIC), spreadsheets, presentations, and 20+ document formats

Multiple Model Providers

Works with OpenAI, Azure OpenAI, AWS Bedrock, Google Gemini, and Vertex AI

Structured Data Extraction

Extract specific fields using JSON schemas instead of full document conversion

Format Preservation

Maintain consistent formatting across pages—perfect for tables and structured content

Smart Image Processing

Automatic orientation correction, edge trimming, and image compression

Concurrent Processing

Process multiple pages in parallel with configurable concurrency limits

When to Use Zerox

  • Converting documents for RAG (Retrieval Augmented Generation) systems
  • Extracting data from invoices, receipts, and forms
  • Processing research papers, reports, and technical documents
  • Building document analysis pipelines
  • Creating searchable archives from scanned documents
  • Real-time OCR on mobile devices (requires API calls to vision models)
  • Offline-only environments (requires internet for model API access)
  • Extremely high-volume processing on tight budgets (vision model API costs)
  • Simple text extraction where traditional OCR is sufficient

Output Format

Zerox returns structured data with per-page content, token usage, and metadata:
{
  "completionTime": 10038,
  "fileName": "invoice_36258",
  "inputTokens": 25543,
  "outputTokens": 210,
  "pages": [
    {
      "page": 1,
      "content": "# INVOICE # 36258\n**Date:** Mar 06 2012...",
      "contentLength": 747,
      "status": "SUCCESS"
    }
  ],
  "summary": {
    "totalPages": 1,
    "ocr": {
      "successful": 1,
      "failed": 0
    }
  }
}

Available SDKs

Zerox is available as both a Node.js and Python package:
npm install zerox

Supported Models

Zerox works with vision models from multiple providers:
ProviderModels
OpenAIGPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini
Azure OpenAIGPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini
AWS BedrockClaude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku (multiple versions)
Google GeminiGemini 1.5 (Flash, Flash-8B, Pro), Gemini 2.0 (Flash, Flash-Lite)
Vertex AIGemini models (Python only)
All vision models from these providers are supported. Refer to each provider’s documentation for the most up-to-date model availability.

Supported File Types

Zerox supports 20+ document formats including:
  • Documents: PDF, DOC, DOCX, ODT, RTF, TXT, HTML, XML
  • Spreadsheets: XLS, XLSX, ODS, CSV, TSV
  • Presentations: PPT, PPTX, ODP
  • Images: PNG, JPG, JPEG, HEIC
  • Other: WPS, WPD
For non-image/non-PDF files, Zerox uses LibreOffice to convert to PDF first, then to images. Make sure LibreOffice is installed on your system if you need to process these formats.

Next Steps

Quickstart

Get started with Zerox in 5 minutes

API Reference

Explore all configuration options

Build docs developers (and LLMs) love