Skip to main content

Overview

Test datasets are essential for evaluating OCR model performance. The buildtestset.py script extracts individual PDF pages and renders them as images for manual annotation and testing.
Unlike silver data generation which uses AI, test sets typically require human annotation for ground truth labels.

Test Set Creation Process

The test set creation workflow:
  1. Sample PDFs - Select documents from your corpus
  2. Sample pages - Extract specific pages from each PDF
  3. Extract single-page PDFs - Create one PDF per page
  4. Render images - Convert pages to PNG format (1024px)
  5. Manual annotation - Add ground truth labels (external process)

Using buildtestset.py

Basic Usage

python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --num_sample_docs 2000 \
  --max_sample_pages 1 \
  --output_dir sampled_pages_output

Arguments

glob_path
string
Local or S3 path glob (e.g., *.pdf or s3://bucket/pdfs/*.pdf)
path_list
string
Path to a file containing PDF paths, one per line
num_sample_docs
integer
default:"2000"
Number of PDF documents to sample
first_n_pages
integer
default:"0"
Always sample the first N pages of each PDF
max_sample_pages
integer
default:"1"
Max number of pages to sample per PDF
no_filter
boolean
Disable filtering to process all PDFs listed
output_dir
string
default:"sampled_pages_output"
Output directory for extracted PDFs and PNGs
reservoir_size
integer
Size of the reservoir for sampling paths. Defaults to 10x num_sample_docs

S3 Support

Process PDFs directly from S3:
python -m olmocr.data.buildtestset \
  --glob_path "s3://my-bucket/documents/*.pdf" \
  --num_sample_docs 500 \
  --output_dir test_data
The script automatically downloads from S3 to /tmp, processes locally, and creates output files.

Page Sampling Strategy

The script uses the same intelligent sampling as silver data generation (buildtestset.py:20-34):
def sample_pdf_pages(num_pages: int, first_n_pages: int, max_sample_pages: int) -> List[int]:
    """
    Returns a list of sampled page indices (1-based).
    - Always include the first_n_pages (or all pages if num_pages < first_n_pages).
    - Randomly sample the remaining pages up to a total of max_sample_pages.
    """
    if num_pages <= first_n_pages:
        return list(range(1, num_pages + 1))
    sample_pages = list(range(1, first_n_pages + 1))
    remaining_pages = list(range(first_n_pages + 1, num_pages + 1))
    if remaining_pages:
        random_pick = min(max_sample_pages - first_n_pages, len(remaining_pages))
        sample_pages += random.sample(remaining_pages, random_pick)
    return sample_pages

Common Sampling Patterns

# Get one random page from each PDF
python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --first_n_pages 0 \
  --max_sample_pages 1

Output Format

File Structure

For each sampled page, the script creates two files:
sampled_pages_output/
├── document1_page1.pdf     # Single-page PDF
├── document1_page1.png     # Rendered PNG image
├── document1_page3.pdf
├── document1_page3.png
├── document2_page1.pdf
├── document2_page1.png
└── ...

Naming Convention

Files are named using the pattern:
{base_pdf_name}_page{page_number}.{extension}
For example, from research_paper.pdf page 5:
  • research_paper_page5.pdf
  • research_paper_page5.png

Single-Page PDF Extraction

The script extracts individual pages into separate PDFs (buildtestset.py:49-59):
def extract_single_page_pdf(input_pdf_path: str, page_number: int, output_pdf_path: str) -> None:
    """
    Extracts exactly one page (page_number, 1-based) from input_pdf_path
    and writes to output_pdf_path.
    """
    reader = PdfReader(input_pdf_path)
    writer = PdfWriter()
    # Page numbers in PdfReader are 0-based
    writer.add_page(reader.pages[page_number - 1])
    with open(output_pdf_path, "wb") as f:
        writer.write(f)
This preserves the original PDF quality and formatting.

Image Rendering

Pages are rendered to PNG at 1024px (longest dimension):
# Render single-page PDF to PNG
b64png = render_pdf_to_base64png(
    single_pdf_path, 
    page_num=0,  # Single page, so always 0
    target_longest_image_dim=1024
)

# Decode and save
with open(single_png_path, "wb") as pngf:
    pngf.write(base64.b64decode(b64png))
The 1024px resolution balances file size with visual quality for annotation tools.

Processing Pipeline

The complete processing flow for each PDF (buildtestset.py:62-108):
1

Download if S3

If the PDF is on S3, download to /tmp
2

Apply filters

Run through PdfFilter unless --no_filter is set
3

Sample pages

Determine which pages to extract
4

Extract PDFs

Create single-page PDF for each sampled page
5

Render images

Convert each single-page PDF to PNG
def process_pdf(pdf_path: str, first_n_pages: int, max_sample_pages: int, 
                no_filter: bool, output_dir: str):
    # Download from S3 if needed
    if pdf_path.startswith("s3://"):
        local_pdf_path = os.path.join("/tmp", os.path.basename(pdf_path))
        fetch_s3_file(pdf_path, local_pdf_path)
    else:
        local_pdf_path = pdf_path
    
    # Filter check
    if (not no_filter) and pdf_filter.filter_out_pdf(local_pdf_path):
        return False
    
    # Sample pages
    reader = PdfReader(local_pdf_path)
    num_pages = len(reader.pages)
    sampled_pages = sample_pdf_pages(num_pages, first_n_pages, max_sample_pages)
    
    # Process each page
    for page_num in sampled_pages:
        # Extract single-page PDF
        extract_single_page_pdf(local_pdf_path, page_num, single_pdf_path)
        
        # Render to PNG
        b64png = render_pdf_to_base64png(single_pdf_path, 0, 1024)
        with open(single_png_path, "wb") as pngf:
            pngf.write(base64.b64decode(b64png))
    
    return True

Filtering Integration

By default, test set generation applies the same filters as silver data:
  • Language detection (English only by default)
  • Form detection (excludes PDF forms)
  • SEO spam detection
See the Filtering guide for configuration details.
If you want a truly random test set, use --no_filter to disable all filtering.

Parallel Processing

The script uses ProcessPoolExecutor for concurrent processing (buildtestset.py:185-201):
with ProcessPoolExecutor() as executor:
    futures = {}
    for pdf_path in pdf_paths:
        future = executor.submit(
            process_pdf, 
            pdf_path, 
            args.first_n_pages, 
            args.max_sample_pages, 
            args.no_filter, 
            args.output_dir
        )
        futures[future] = pdf_path
    
    for future in tqdm(as_completed(futures), total=len(futures)):
        if future.result():
            pdfs_with_output += 1
        if pdfs_with_output >= args.num_sample_docs:
            executor.shutdown(cancel_futures=True)
            break
This maximizes throughput by processing multiple PDFs simultaneously.

Reservoir Sampling

For large PDF collections, reservoir sampling ensures uniform random selection (buildtestset.py:129-177):
pdf_paths = []
n = 0  # total items seen

for path in glob.iglob(args.glob_path, recursive=True):
    n += 1
    if len(pdf_paths) < args.reservoir_size:
        pdf_paths.append(path)
    else:
        s = random.randint(1, n)
        if s <= args.reservoir_size:
            pdf_paths[s - 1] = path

random.shuffle(pdf_paths)
This allows sampling from arbitrarily large datasets without loading all paths into memory.

Example Workflow

Complete workflow for creating a test set:
# 1. Generate test pages
python -m olmocr.data.buildtestset \
  --glob_path "s3://my-bucket/pdfs/**/*.pdf" \
  --num_sample_docs 1000 \
  --max_sample_pages 2 \
  --output_dir test_pages

# 2. Review output
ls test_pages/ | head
# document1_page1.pdf
# document1_page1.png
# document1_page2.pdf
# document1_page2.png
# ...

# 3. Upload for annotation
aws s3 sync test_pages/ s3://annotation-bucket/test-set/

# 4. Annotate using your preferred tool
# - Label Studio
# - Prodigy
# - Custom annotation interface

# 5. Use annotations for evaluation

Test Set Size Recommendations

Recommended test set sizes:
  • Quick validation: 100-200 pages
  • Standard evaluation: 500-1000 pages
  • Comprehensive testing: 2000-5000 pages
  • Statistical significance: 10,000+ pages
Consider diversity:
  • Document types (papers, forms, books)
  • Languages (if multilingual)
  • Quality levels (scanned vs born-digital)
  • Layout complexity (simple text vs complex tables)

Common Issues

Check if PDFs are being filtered out:
# Try with filtering disabled
python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --no_filter \
  --output_dir test_output
If this produces output, your PDFs are being filtered. See Filtering.
Ensure AWS credentials are configured:
aws configure
# Or set environment variables:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
Reduce reservoir size or process in batches:
python -m olmocr.data.buildtestset \
  --glob_path "data/*.pdf" \
  --reservoir_size 5000 \
  --num_sample_docs 1000

Next Steps

Silver Data

Generate training data with ChatGPT 4o

Filtering

Configure PDF filtering options

Build docs developers (and LLMs) love