Skip to main content

Overview

The Data Loading API provides tools for building fine-tuning datasets from OpenAI-style batch responses, managing PDF caching, and preparing data for vision-language model training. It handles S3 and local file systems, parallel processing, and automatic data validation.

Main Functions

build_finetuning_dataset

Build a complete fine-tuning dataset from OpenAI batch response JSONL files.
response_glob_path
str
required
Glob pattern for response JSONL files (supports S3 paths like s3://bucket/path/*.jsonl)
pdf_cache_location
str | None
Local directory for caching PDFs (defaults to ~/.cache/olmocr_pdfs)
num_proc
int
default:"32"
Number of parallel processes for dataset operations
dataset
Dataset
HuggingFace Dataset with columns:
  • s3_path: S3 path to PDF
  • page_num: Page number (1-indexed)
  • response: OCR response text
  • finish_reason: Completion status
  • local_pdf_path: Local cached PDF path
Source: olmocr/train/dataloader.py:136
from olmocr.train.dataloader import build_finetuning_dataset

# Load from S3
dataset = build_finetuning_dataset(
    response_glob_path="s3://my-bucket/responses/*.jsonl",
    pdf_cache_location="/mnt/cache/pdfs",
    num_proc=64
)

print(f"Loaded {len(dataset)} training examples")
print(dataset[0])
The function automatically:
  • Filters entries with finish_reason != "stop" (errors and overruns)
  • Downloads and caches PDFs locally
  • Validates that anchor text can be generated
  • Removes pages with missing PDF data

list_dataset_files

List files matching a glob pattern from S3 or local filesystem.
s3_glob_path
str
required
Glob pattern (e.g., s3://bucket/path/*.jsonl or ./data/*.json)
files
list[str]
List of matching file paths
Source: olmocr/train/dataloader.py:24
from olmocr.train.dataloader import list_dataset_files

# List S3 files with wildcard
files = list_dataset_files("s3://bucket/data/batch_*.jsonl")
print(f"Found {len(files)} files")

load_jsonl_into_ds

Load JSONL files into a HuggingFace Dataset.
s3_glob_path
str
required
Glob pattern for JSONL files
first_n_files
int | None
Limit to first N files (useful for testing)
dataset
Dataset
HuggingFace Dataset with train split
Source: olmocr/train/dataloader.py:52
from olmocr.train.dataloader import load_jsonl_into_ds

# Load all files
dataset = load_jsonl_into_ds("s3://bucket/responses/*.jsonl")

# Load first 10 files for testing
test_dataset = load_jsonl_into_ds(
    "s3://bucket/responses/*.jsonl",
    first_n_files=10
)

extract_openai_batch_response

Parse OpenAI batch API response format into structured fields.
example
dict
required
Raw JSONL entry from OpenAI batch response
parsed
dict
Dictionary with keys:
  • s3_path: Extracted S3 path
  • page_num: Extracted page number
  • response: Response content
  • finish_reason: Completion status
Source: olmocr/train/dataloader.py:70
from olmocr.train.dataloader import extract_openai_batch_response

raw_entry = {
    "custom_id": "s3://bucket/doc.pdf-5",
    "response": {
        "body": {
            "choices": [{
                "message": {"content": "Extracted text..."},
                "finish_reason": "stop"
            }]
        }
    }
}

parsed = extract_openai_batch_response(raw_entry)
print(parsed)
# {'s3_path': 's3://bucket/doc.pdf', 'page_num': 5, 
#  'response': 'Extracted text...', 'finish_reason': 'stop'}

cache_s3_files

Download and cache S3 PDFs to local storage with file locking.
dataset
Dataset
required
Dataset with s3_path column
pdf_cache_location
str
required
Local directory for cached files
num_proc
int
default:"32"
Number of parallel download processes
dataset
Dataset
Dataset with added local_pdf_path column
Source: olmocr/train/dataloader.py:116
from olmocr.train.dataloader import cache_s3_files
from datasets import Dataset

dataset = Dataset.from_dict({
    "s3_path": [
        "s3://bucket/doc1.pdf",
        "s3://bucket/doc2.pdf"
    ]
})

# Cache PDFs locally
cached_dataset = cache_s3_files(
    dataset,
    pdf_cache_location="/mnt/cache/pdfs",
    num_proc=16
)

print(cached_dataset[0]["local_pdf_path"])
# /mnt/cache/pdfs/bucket__doc1.pdf
Uses file locking to prevent corruption during parallel downloads. Multiple processes can safely cache the same file.

Helper Functions

_cache_s3_file

Internal function to download a single S3 file with locking.
s3_path
str
required
S3 path to download
local_cache_dir
str
required
Local cache directory
local_file_path
str
Path to cached file
Source: olmocr/train/dataloader.py:92
from olmocr.train.dataloader import _cache_s3_file

local_path = _cache_s3_file(
    "s3://bucket/document.pdf",
    "/mnt/cache/pdfs"
)

print(local_path)
# /mnt/cache/pdfs/bucket__document.pdf
File names are generated as {bucket}__{key_with_underscores} to avoid path conflicts.

Data Processing Pipeline

1

List Files

Find all JSONL files matching the glob pattern from S3 or local filesystem
2

Load JSONL

Parse JSONL files into HuggingFace Dataset format
3

Extract Fields

Parse OpenAI batch response format and extract s3_path, page_num, response, finish_reason
4

Filter Valid Entries

Keep only entries with finish_reason == "stop" (successful completions)
5

Cache PDFs

Download referenced PDFs to local cache with parallel processing
6

Validate Pages

Filter out pages where anchor text generation fails
7

Return Dataset

Provide ready-to-use dataset with local PDF paths

Dataset Schema

The final dataset contains the following columns:
s3_path
str
S3 path to source PDF document
page_num
int
Page number within PDF (1-indexed)
response
str
OCR response text from the model
finish_reason
str
Completion status (“stop” for successful, “length” for truncated)
local_pdf_path
str
Path to locally cached PDF file

Integration with Training

The dataloader integrates with the training pipeline through make_dataset in train/utils.py:
from olmocr.train.utils import make_dataset
from olmocr.train.core.config import TrainConfig
from transformers import AutoProcessor

# Load processor
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    trust_remote_code=True
)

# Build datasets with transforms
train_dataset, valid_dataset = make_dataset(config, processor)

# Datasets are ready for training
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(valid_dataset)}")
Source: olmocr/train/utils.py:50

Custom ID Format

OpenAI batch responses use custom_id to identify pages: Format: s3://bucket/path/file.pdf-{page_number} Example: s3://ai2-pdfs/39ce/3db4.pdf-4 refers to page 4
from olmocr.s3_utils import parse_custom_id

s3_path, page_num = parse_custom_id("s3://bucket/doc.pdf-7")
print(s3_path)   # "s3://bucket/doc.pdf"
print(page_num)  # 7

File Caching Strategy

Cache Directory Structure

~/.cache/olmocr_pdfs/
├── bucket1__path_to_file1.pdf
├── bucket1__path_to_file1.pdf.lock
├── bucket2__another_file.pdf
└── bucket2__another_file.pdf.lock

Locking Mechanism

  • Each file has an associated .lock file
  • FileLock ensures atomic downloads
  • Multiple processes can safely access the cache
  • Already cached files are reused (not re-downloaded)
from filelock import FileLock
import os

lock_file = "/cache/file.pdf.lock"

with FileLock(lock_file):
    if not os.path.exists("/cache/file.pdf"):
        # Download file
        download_from_s3(...)
    else:
        # File already exists, skip download
        pass

AWS Configuration

For S3 access, configure AWS credentials:
export DS_AWS_ACCESS_KEY_ID="your-access-key"
export DS_AWS_SECRET_ACCESS_KEY="your-secret-key"

Data Validation

The dataloader validates each page can generate anchor text:
from olmocr.prompts.anchor import get_anchor_text
from olmocr.data.renderpdf import get_pdf_media_box_width_height

def _can_create_anchor_text(example):
    try:
        anchor_text = get_anchor_text(
            example["local_pdf_path"],
            example["page_num"],
            pdf_engine="pdfreport",
            target_length=4000
        )
        _ = get_pdf_media_box_width_height(
            example["local_pdf_path"],
            example["page_num"]
        )
        return anchor_text is not None
    except:
        return False

# Applied during build_finetuning_dataset
dataset = dataset.filter(_can_create_anchor_text, num_proc=32)
Source: olmocr/train/dataloader.py:154

Performance Optimization

  • Increase num_proc for faster dataset operations
  • Default of 32 works well for most systems
  • Balance against available CPU cores and I/O bandwidth
  • S3 downloads benefit from higher parallelism
  • Use fast local storage (SSD) for cache directory
  • Shared cache across multiple training runs saves bandwidth
  • Monitor cache size and clean old files periodically
  • Consider mounting S3 with s3fs for large datasets
  • Filter early to reduce processing time
  • finish_reason == "stop" removes ~5-10% of data
  • Anchor text validation catches corrupted PDFs
  • Apply additional filters before caching PDFs

Error Handling

The dataloader handles common errors gracefully:

Invalid S3 Paths

try:
    dataset = build_finetuning_dataset(
        response_glob_path="s3://invalid-bucket/path/*.jsonl"
    )
except ValueError as e:
    print(f"Invalid S3 path: {e}")

Missing PDFs

# Pages with missing PDFs are filtered out
# Check logs for:
# "Could not generate anchor text for file"

Corrupted Files

# Silently skipped during validation
# Monitor dataset size before/after filtering
initial_size = len(raw_dataset)
final_size = len(filtered_dataset)
print(f"Filtered {initial_size - final_size} invalid pages")

Advanced Usage

Custom Data Sources

from olmocr.train.dataloader import (
    load_jsonl_into_ds,
    extract_openai_batch_response,
    cache_s3_files
)

# Load from custom location
raw_data = load_jsonl_into_ds("s3://custom-bucket/data/*.jsonl")

# Apply custom transformations
processed = raw_data.map(
    extract_openai_batch_response,
    remove_columns=raw_data.column_names,
    num_proc=64
)

# Custom filtering
filtered = processed.filter(
    lambda x: len(x["response"]) > 100,  # Min length
    num_proc=32
)

# Cache PDFs
final_dataset = cache_s3_files(
    filtered,
    pdf_cache_location="/fast/ssd/cache",
    num_proc=64
)

Combining Multiple Sources

from datasets import concatenate_datasets
from olmocr.train.dataloader import build_finetuning_dataset

# Load multiple sources
source1 = build_finetuning_dataset("s3://bucket1/data/*.jsonl")
source2 = build_finetuning_dataset("s3://bucket2/data/*.jsonl")

# Combine
combined = concatenate_datasets([source1, source2])

# Shuffle
combined = combined.shuffle(seed=42)

print(f"Combined dataset: {len(combined)} examples")

Best Practices

Glob Patterns

Use specific patterns to avoid loading unwanted files:
  • batch_*.jsonl instead of *.jsonl
  • responses/2024-*/*.jsonl for date-based organization
  • Test with first_n_files before full load

Cache Management

Maintain a healthy cache:
  • Monitor disk usage regularly
  • Clean old files after training completes
  • Use dedicated cache directory per project
  • Consider cache size in disk space planning

Validation

Always validate data quality:
  • Check finish_reason == "stop" ratio
  • Verify anchor text generation succeeds
  • Inspect random samples before training
  • Track filtered page counts

Performance

Optimize for your infrastructure:
  • Adjust num_proc based on CPU count
  • Use local SSD for cache when possible
  • Monitor network bandwidth for S3 downloads
  • Profile dataset loading times

See Also

Build docs developers (and LLMs) love