Overview
The Data Loading API provides tools for building fine-tuning datasets from OpenAI-style batch responses, managing PDF caching, and preparing data for vision-language model training. It handles S3 and local file systems, parallel processing, and automatic data validation.Main Functions
build_finetuning_dataset
Build a complete fine-tuning dataset from OpenAI batch response JSONL files.Glob pattern for response JSONL files (supports S3 paths like
s3://bucket/path/*.jsonl)Local directory for caching PDFs (defaults to
~/.cache/olmocr_pdfs)Number of parallel processes for dataset operations
HuggingFace Dataset with columns:
s3_path: S3 path to PDFpage_num: Page number (1-indexed)response: OCR response textfinish_reason: Completion statuslocal_pdf_path: Local cached PDF path
olmocr/train/dataloader.py:136
The function automatically:
- Filters entries with
finish_reason != "stop"(errors and overruns) - Downloads and caches PDFs locally
- Validates that anchor text can be generated
- Removes pages with missing PDF data
list_dataset_files
List files matching a glob pattern from S3 or local filesystem.Glob pattern (e.g.,
s3://bucket/path/*.jsonl or ./data/*.json)List of matching file paths
olmocr/train/dataloader.py:24
load_jsonl_into_ds
Load JSONL files into a HuggingFace Dataset.Glob pattern for JSONL files
Limit to first N files (useful for testing)
HuggingFace Dataset with train split
olmocr/train/dataloader.py:52
extract_openai_batch_response
Parse OpenAI batch API response format into structured fields.Raw JSONL entry from OpenAI batch response
Dictionary with keys:
s3_path: Extracted S3 pathpage_num: Extracted page numberresponse: Response contentfinish_reason: Completion status
olmocr/train/dataloader.py:70
cache_s3_files
Download and cache S3 PDFs to local storage with file locking.Dataset with
s3_path columnLocal directory for cached files
Number of parallel download processes
Dataset with added
local_pdf_path columnolmocr/train/dataloader.py:116
Helper Functions
_cache_s3_file
Internal function to download a single S3 file with locking.S3 path to download
Local cache directory
Path to cached file
olmocr/train/dataloader.py:92
File names are generated as
{bucket}__{key_with_underscores} to avoid path conflicts.Data Processing Pipeline
Extract Fields
Parse OpenAI batch response format and extract s3_path, page_num, response, finish_reason
Dataset Schema
The final dataset contains the following columns:S3 path to source PDF document
Page number within PDF (1-indexed)
OCR response text from the model
Completion status (“stop” for successful, “length” for truncated)
Path to locally cached PDF file
Integration with Training
The dataloader integrates with the training pipeline throughmake_dataset in train/utils.py:
olmocr/train/utils.py:50
Custom ID Format
OpenAI batch responses usecustom_id to identify pages:
Format: s3://bucket/path/file.pdf-{page_number}
Example: s3://ai2-pdfs/39ce/3db4.pdf-4 refers to page 4
File Caching Strategy
Cache Directory Structure
Locking Mechanism
- Each file has an associated
.lockfile - FileLock ensures atomic downloads
- Multiple processes can safely access the cache
- Already cached files are reused (not re-downloaded)
AWS Configuration
For S3 access, configure AWS credentials:Data Validation
The dataloader validates each page can generate anchor text:olmocr/train/dataloader.py:154
Performance Optimization
Parallel Processing
Parallel Processing
- Increase
num_procfor faster dataset operations - Default of 32 works well for most systems
- Balance against available CPU cores and I/O bandwidth
- S3 downloads benefit from higher parallelism
Caching Strategy
Caching Strategy
- Use fast local storage (SSD) for cache directory
- Shared cache across multiple training runs saves bandwidth
- Monitor cache size and clean old files periodically
- Consider mounting S3 with s3fs for large datasets
Data Filtering
Data Filtering
- Filter early to reduce processing time
finish_reason == "stop"removes ~5-10% of data- Anchor text validation catches corrupted PDFs
- Apply additional filters before caching PDFs
Error Handling
The dataloader handles common errors gracefully:Invalid S3 Paths
Missing PDFs
Corrupted Files
Advanced Usage
Custom Data Sources
Combining Multiple Sources
Best Practices
Glob Patterns
Use specific patterns to avoid loading unwanted files:
batch_*.jsonlinstead of*.jsonlresponses/2024-*/*.jsonlfor date-based organization- Test with
first_n_filesbefore full load
Cache Management
Maintain a healthy cache:
- Monitor disk usage regularly
- Clean old files after training completes
- Use dedicated cache directory per project
- Consider cache size in disk space planning
Validation
Always validate data quality:
- Check
finish_reason == "stop"ratio - Verify anchor text generation succeeds
- Inspect random samples before training
- Track filtered page counts
Performance
Optimize for your infrastructure:
- Adjust
num_procbased on CPU count - Use local SSD for cache when possible
- Monitor network bandwidth for S3 downloads
- Profile dataset loading times
See Also
- Training API - Model training with loaded datasets
- Evaluation API - Evaluating model outputs
- Data Preparation Guide - Preparing training data