Building Test Datasets

Overview

Test datasets are essential for evaluating OCR model performance. The buildtestset.py script extracts individual PDF pages and renders them as images for manual annotation and testing.

Unlike silver data generation which uses AI, test sets typically require human annotation for ground truth labels.

Test Set Creation Process

The test set creation workflow:

Sample PDFs - Select documents from your corpus
Sample pages - Extract specific pages from each PDF
Extract single-page PDFs - Create one PDF per page
Render images - Convert pages to PNG format (1024px)
Manual annotation - Add ground truth labels (external process)

Using buildtestset.py

Basic Usage

python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --num_sample_docs 2000 \
  --max_sample_pages 1 \
  --output_dir sampled_pages_output

Arguments

glob_path

string

Local or S3 path glob (e.g., *.pdf or s3://bucket/pdfs/*.pdf)

path_list

string

Path to a file containing PDF paths, one per line

num_sample_docs

integer

default:"2000"

Number of PDF documents to sample

first_n_pages

integer

default:"0"

Always sample the first N pages of each PDF

max_sample_pages

integer

default:"1"

Max number of pages to sample per PDF

no_filter

boolean

Disable filtering to process all PDFs listed

output_dir

string

default:"sampled_pages_output"

Output directory for extracted PDFs and PNGs

reservoir_size

integer

Size of the reservoir for sampling paths. Defaults to 10x num_sample_docs

S3 Support

Process PDFs directly from S3:

python -m olmocr.data.buildtestset \
  --glob_path "s3://my-bucket/documents/*.pdf" \
  --num_sample_docs 500 \
  --output_dir test_data

The script automatically downloads from S3 to /tmp, processes locally, and creates output files.

Page Sampling Strategy

The script uses the same intelligent sampling as silver data generation (buildtestset.py:20-34):

def sample_pdf_pages(num_pages: int, first_n_pages: int, max_sample_pages: int) -> List[int]:
    """
    Returns a list of sampled page indices (1-based).
    - Always include the first_n_pages (or all pages if num_pages < first_n_pages).
    - Randomly sample the remaining pages up to a total of max_sample_pages.
    """
    if num_pages <= first_n_pages:
        return list(range(1, num_pages + 1))
    sample_pages = list(range(1, first_n_pages + 1))
    remaining_pages = list(range(first_n_pages + 1, num_pages + 1))
    if remaining_pages:
        random_pick = min(max_sample_pages - first_n_pages, len(remaining_pages))
        sample_pages += random.sample(remaining_pages, random_pick)
    return sample_pages

Common Sampling Patterns

Single Random Page
First Page Only
Multiple Pages

# Get one random page from each PDF
python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --first_n_pages 0 \
  --max_sample_pages 1

# Get only the first page (often has title/abstract)
python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --first_n_pages 1 \
  --max_sample_pages 1

# Get first 2 pages + 3 random pages (5 total)
python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --first_n_pages 2 \
  --max_sample_pages 5

Output Format

File Structure

For each sampled page, the script creates two files:

sampled_pages_output/
├── document1_page1.pdf     # Single-page PDF
├── document1_page1.png     # Rendered PNG image
├── document1_page3.pdf
├── document1_page3.png
├── document2_page1.pdf
├── document2_page1.png
└── ...

Naming Convention

Files are named using the pattern:

{base_pdf_name}_page{page_number}.{extension}

For example, from research_paper.pdf page 5:

research_paper_page5.pdf
research_paper_page5.png

Single-Page PDF Extraction

The script extracts individual pages into separate PDFs (buildtestset.py:49-59):

def extract_single_page_pdf(input_pdf_path: str, page_number: int, output_pdf_path: str) -> None:
    """
    Extracts exactly one page (page_number, 1-based) from input_pdf_path
    and writes to output_pdf_path.
    """
    reader = PdfReader(input_pdf_path)
    writer = PdfWriter()
    # Page numbers in PdfReader are 0-based
    writer.add_page(reader.pages[page_number - 1])
    with open(output_pdf_path, "wb") as f:
        writer.write(f)

This preserves the original PDF quality and formatting.

Image Rendering

Pages are rendered to PNG at 1024px (longest dimension):

# Render single-page PDF to PNG
b64png = render_pdf_to_base64png(
    single_pdf_path, 
    page_num=0,  # Single page, so always 0
    target_longest_image_dim=1024
)

# Decode and save
with open(single_png_path, "wb") as pngf:
    pngf.write(base64.b64decode(b64png))

The 1024px resolution balances file size with visual quality for annotation tools.

Processing Pipeline

The complete processing flow for each PDF (buildtestset.py:62-108):

Download if S3

If the PDF is on S3, download to /tmp

Apply filters

Run through PdfFilter unless --no_filter is set

Sample pages

Determine which pages to extract

Extract PDFs

Create single-page PDF for each sampled page

Render images

Convert each single-page PDF to PNG

def process_pdf(pdf_path: str, first_n_pages: int, max_sample_pages: int, 
                no_filter: bool, output_dir: str):
    # Download from S3 if needed
    if pdf_path.startswith("s3://"):
        local_pdf_path = os.path.join("/tmp", os.path.basename(pdf_path))
        fetch_s3_file(pdf_path, local_pdf_path)
    else:
        local_pdf_path = pdf_path
    
    # Filter check
    if (not no_filter) and pdf_filter.filter_out_pdf(local_pdf_path):
        return False
    
    # Sample pages
    reader = PdfReader(local_pdf_path)
    num_pages = len(reader.pages)
    sampled_pages = sample_pdf_pages(num_pages, first_n_pages, max_sample_pages)
    
    # Process each page
    for page_num in sampled_pages:
        # Extract single-page PDF
        extract_single_page_pdf(local_pdf_path, page_num, single_pdf_path)
        
        # Render to PNG
        b64png = render_pdf_to_base64png(single_pdf_path, 0, 1024)
        with open(single_png_path, "wb") as pngf:
            pngf.write(base64.b64decode(b64png))
    
    return True

Filtering Integration

By default, test set generation applies the same filters as silver data:

Language detection (English only by default)
Form detection (excludes PDF forms)
SEO spam detection

See the Filtering guide for configuration details.

If you want a truly random test set, use --no_filter to disable all filtering.

Parallel Processing

The script uses ProcessPoolExecutor for concurrent processing (buildtestset.py:185-201):

with ProcessPoolExecutor() as executor:
    futures = {}
    for pdf_path in pdf_paths:
        future = executor.submit(
            process_pdf, 
            pdf_path, 
            args.first_n_pages, 
            args.max_sample_pages, 
            args.no_filter, 
            args.output_dir
        )
        futures[future] = pdf_path
    
    for future in tqdm(as_completed(futures), total=len(futures)):
        if future.result():
            pdfs_with_output += 1
        if pdfs_with_output >= args.num_sample_docs:
            executor.shutdown(cancel_futures=True)
            break

This maximizes throughput by processing multiple PDFs simultaneously.

Reservoir Sampling

For large PDF collections, reservoir sampling ensures uniform random selection (buildtestset.py:129-177):

pdf_paths = []
n = 0  # total items seen

for path in glob.iglob(args.glob_path, recursive=True):
    n += 1
    if len(pdf_paths) < args.reservoir_size:
        pdf_paths.append(path)
    else:
        s = random.randint(1, n)
        if s <= args.reservoir_size:
            pdf_paths[s - 1] = path

random.shuffle(pdf_paths)

This allows sampling from arbitrarily large datasets without loading all paths into memory.

Example Workflow

Complete workflow for creating a test set:

# 1. Generate test pages
python -m olmocr.data.buildtestset \
  --glob_path "s3://my-bucket/pdfs/**/*.pdf" \
  --num_sample_docs 1000 \
  --max_sample_pages 2 \
  --output_dir test_pages

# 2. Review output
ls test_pages/ | head
# document1_page1.pdf
# document1_page1.png
# document1_page2.pdf
# document1_page2.png
# ...

# 3. Upload for annotation
aws s3 sync test_pages/ s3://annotation-bucket/test-set/

# 4. Annotate using your preferred tool
# - Label Studio
# - Prodigy
# - Custom annotation interface

# 5. Use annotations for evaluation

Test Set Size Recommendations

How many test samples do I need?

Recommended test set sizes:

Quick validation: 100-200 pages
Standard evaluation: 500-1000 pages
Comprehensive testing: 2000-5000 pages
Statistical significance: 10,000+ pages

Consider diversity:

Document types (papers, forms, books)
Languages (if multilingual)
Quality levels (scanned vs born-digital)
Layout complexity (simple text vs complex tables)

Common Issues

No output files created

Check if PDFs are being filtered out:

# Try with filtering disabled
python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --no_filter \
  --output_dir test_output

If this produces output, your PDFs are being filtered. See Filtering.

S3 download errors

Ensure AWS credentials are configured:

aws configure
# Or set environment variables:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Out of memory errors

Reduce reservoir size or process in batches:

python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --reservoir_size 5000 \
  --num_sample_docs 1000

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

Building Test Datasets

Overview

Test Set Creation Process

Using buildtestset.py

Basic Usage

Arguments

S3 Support

Page Sampling Strategy

Common Sampling Patterns

Output Format

File Structure

Naming Convention

Single-Page PDF Extraction

Image Rendering

Processing Pipeline

Filtering Integration

Parallel Processing

Reservoir Sampling

Example Workflow

Test Set Size Recommendations

Common Issues

Next Steps

Silver Data

Filtering

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Overview

​Test Set Creation Process

​Using buildtestset.py

​Basic Usage

​Arguments

​S3 Support

​Page Sampling Strategy

​Common Sampling Patterns

​Output Format

​File Structure

​Naming Convention

​Single-Page PDF Extraction

​Image Rendering

​Processing Pipeline

​Filtering Integration

​Parallel Processing

​Reservoir Sampling

​Example Workflow

​Test Set Size Recommendations

​Common Issues

​Next Steps

Silver Data

Filtering

Build docs developers (and LLMs) love

Overview

Test Set Creation Process

Using buildtestset.py

Basic Usage

Arguments

S3 Support

Page Sampling Strategy

Common Sampling Patterns

Output Format

File Structure

Naming Convention

Single-Page PDF Extraction

Image Rendering

Processing Pipeline

Filtering Integration

Parallel Processing

Reservoir Sampling

Example Workflow

Test Set Size Recommendations

Common Issues

Next Steps