Test datasets are essential for evaluating OCR model performance. The buildtestset.py script extracts individual PDF pages and renders them as images for manual annotation and testing.
Unlike silver data generation which uses AI, test sets typically require human annotation for ground truth labels.
The script uses the same intelligent sampling as silver data generation (buildtestset.py:20-34):
def sample_pdf_pages(num_pages: int, first_n_pages: int, max_sample_pages: int) -> List[int]: """ Returns a list of sampled page indices (1-based). - Always include the first_n_pages (or all pages if num_pages < first_n_pages). - Randomly sample the remaining pages up to a total of max_sample_pages. """ if num_pages <= first_n_pages: return list(range(1, num_pages + 1)) sample_pages = list(range(1, first_n_pages + 1)) remaining_pages = list(range(first_n_pages + 1, num_pages + 1)) if remaining_pages: random_pick = min(max_sample_pages - first_n_pages, len(remaining_pages)) sample_pages += random.sample(remaining_pages, random_pick) return sample_pages
# Get one random page from each PDFpython -m olmocr.data.buildtestset \ --glob_path "data/*.pdf" \ --first_n_pages 0 \ --max_sample_pages 1
# Get only the first page (often has title/abstract)python -m olmocr.data.buildtestset \ --glob_path "data/*.pdf" \ --first_n_pages 1 \ --max_sample_pages 1
# Get first 2 pages + 3 random pages (5 total)python -m olmocr.data.buildtestset \ --glob_path "data/*.pdf" \ --first_n_pages 2 \ --max_sample_pages 5
Pages are rendered to PNG at 1024px (longest dimension):
# Render single-page PDF to PNGb64png = render_pdf_to_base64png( single_pdf_path, page_num=0, # Single page, so always 0 target_longest_image_dim=1024)# Decode and savewith open(single_png_path, "wb") as pngf: pngf.write(base64.b64decode(b64png))
The 1024px resolution balances file size with visual quality for annotation tools.
The script uses ProcessPoolExecutor for concurrent processing (buildtestset.py:185-201):
with ProcessPoolExecutor() as executor: futures = {} for pdf_path in pdf_paths: future = executor.submit( process_pdf, pdf_path, args.first_n_pages, args.max_sample_pages, args.no_filter, args.output_dir ) futures[future] = pdf_path for future in tqdm(as_completed(futures), total=len(futures)): if future.result(): pdfs_with_output += 1 if pdfs_with_output >= args.num_sample_docs: executor.shutdown(cancel_futures=True) break
This maximizes throughput by processing multiple PDFs simultaneously.
For large PDF collections, reservoir sampling ensures uniform random selection (buildtestset.py:129-177):
pdf_paths = []n = 0 # total items seenfor path in glob.iglob(args.glob_path, recursive=True): n += 1 if len(pdf_paths) < args.reservoir_size: pdf_paths.append(path) else: s = random.randint(1, n) if s <= args.reservoir_size: pdf_paths[s - 1] = pathrandom.shuffle(pdf_paths)
This allows sampling from arbitrarily large datasets without loading all paths into memory.
If this produces output, your PDFs are being filtered. See Filtering.
S3 download errors
Ensure AWS credentials are configured:
aws configure# Or set environment variables:export AWS_ACCESS_KEY_ID=your_keyexport AWS_SECRET_ACCESS_KEY=your_secretexport AWS_DEFAULT_REGION=us-east-1