Introduction
olmOCR supports fine-tuning vision-language models for document OCR and understanding tasks. The training pipeline uses LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning and includes comprehensive support for distributed training, experiment tracking, and checkpoint management.Supported Models
olmOCR supports fine-tuning two model families:- Qwen2-VL: Efficient vision-language models (2B and 7B variants)
- Molmo: Advanced multimodal models from AI2 (Molmo-7B-O)
Training Approach
LoRA Fine-Tuning
olmOCR uses LoRA (Low-Rank Adaptation) for efficient fine-tuning. LoRA adds trainable rank decomposition matrices to model layers while keeping the original weights frozen. Benefits:- Reduces memory requirements significantly
- Faster training with fewer parameters
- Easy to merge adapters back into base model
- Multiple adapters can be trained for different tasks
train/train.py
Training Pipeline
The training pipeline consists of several key stages:Data Loading
Load and preprocess training data from S3 or local storage. PDF pages are cached locally and converted to images with anchor text.
Model Initialization
Load the base model (Qwen2-VL or Molmo) with optional flash attention and apply LoRA adapters.
Training Loop
Train using HuggingFace Trainer with gradient accumulation, checkpointing, and evaluation.
WandB Integration
olmOCR integrates with Weights & Biases for experiment tracking and visualization.Configuration
Logged Metrics
The training loop automatically logs:- Training loss
- Evaluation loss per validation dataset
- Learning rate schedule
- Gradient norms
- LoRA configuration
- Training hyperparameters
- Beaker job information (if running on Beaker)
train/train.py
Beaker Integration
olmOCR supports distributed training on AI2’s Beaker cluster.Running on Beaker
Beaker integration provides:- Automatic job tracking and metadata
- Links to Beaker jobs in WandB runs
- S3 checkpoint synchronization
- Multi-GPU distributed training
train/train.py
Distributed Training
The training script handles distributed setup automatically:train/utils.py
- Log to WandB
- Save checkpoints
- Display progress bars
Data Processing
Dataset Format
olmOCR expects training data in OpenAI batch response format:Data Preparation
For each training example:- Extract PDF page from S3 or local cache
- Generate anchor text from PDF structure
- Render PDF page to image at target resolution
- Create model-specific input format (Qwen2-VL or Molmo)
- Tokenize and prepare labels
train/dataprep.py
Installation
Install training dependencies:torchandtorchvisiontransformers(>=4.45.1)peftfor LoRAacceleratefor distributed trainingdatasetsfor data loadingwandbfor experiment trackings3fsfor S3 access
Next Steps
Fine-tune Qwen2-VL
Learn how to fine-tune Qwen2-VL models
Fine-tune Molmo
Train Molmo models for document understanding
Configuration
Explore all training configuration options
Data Preparation
Prepare your own training data