Training Configuration

Overview

olmOCR training is configured using YAML files and the TrainConfig dataclass. This page documents all available configuration options.

Configuration Structure

The configuration is organized into nested sections:

model:        # Model loading configuration
lora:         # LoRA adapter configuration
aws:          # AWS S3 credentials
wandb:        # Weights & Biases logging
train_data:   # Training dataset configuration
valid_data:   # Validation dataset configuration
generate:     # Generation parameters
hparams:      # Training hyperparameters
save:         # Checkpoint saving
max_workers:  # Data loading workers

Model Configuration

ModelConfig

Controls how the model is loaded and initialized.

train/core/config.py

@dataclass
class ModelConfig:
    name_or_path: str
    arch: str
    dtype: str = "bfloat16"
    use_flash_attn: bool = False
    trust_remote_code: bool = False
    low_cpu_mem_usage: bool = False
    fast_tokenizer: bool = True
    model_revision: Optional[str] = None

name_or_path

string

required

The model name or path to load. Must be compatible with HuggingFace transformers.Examples:

Qwen/Qwen2-VL-7B-Instruct
allenai/Molmo-7B-O-0924
/path/to/local/model

arch

string

default:"causal"

The model architecture type. Options: causal, vllm

dtype

string

default:"bfloat16"

Precision for model weights. Options: bfloat16, float16, float32

use_flash_attn

boolean

default:"false"

Whether to use flash attention for faster training. Requires compatible GPU (Ampere or newer).

trust_remote_code

boolean

default:"false"

Whether to trust remote code when loading models. Set to true for models requiring custom code.

model_revision

string

Specific model revision/commit to use from HuggingFace Hub.

Example

model:
  name_or_path: Qwen/Qwen2-VL-7B-Instruct
  arch: causal
  dtype: bfloat16
  use_flash_attn: true
  trust_remote_code: false

LoRA Configuration

LoraConfig

Configures Low-Rank Adaptation for parameter-efficient fine-tuning.

train/core/config.py

@dataclass
class LoraConfig:
    rank: int = 16
    alpha: int = 16
    dropout: float = 0.05
    bias: str = "none"
    task_type: str = TaskType.CAUSAL_LM
    target_modules: List[str] = [...]

rank

integer

default:"16"

The rank of the LoRA decomposition. Higher values = more parameters and capacity.Recommended values:

16: Lightweight, good for simple tasks
32: Balanced (recommended for most use cases)
64: Maximum capacity for complex tasks

alpha

integer

default:"16"

LoRA scaling parameter. Typically set equal to rank.Formula: scaling = alpha / rank

dropout

float

default:"0.05"

Dropout probability for LoRA layers. Helps prevent overfitting.

bias

string

default:"none"

Bias configuration. Options: none, all, lora_only

task_type

string

default:"CAUSAL_LM"

The task type for PEFT. Use CAUSAL_LM for language modeling.

target_modules

list[string]

required

List of module names to apply LoRA adapters to. Supports regex patterns.

Target Modules Examples

Qwen2-VL
Molmo

lora:
  rank: 32
  alpha: 32
  dropout: 0.05
  task_type: CAUSAL_LM
  target_modules:
    # Language model
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    # Vision encoder
    - visual.blocks.[0-9]+.attn.qkv
    - visual.blocks.[0-9]+.attn.proj
    - visual.blocks.[0-9]+.mlp.fc1
    - visual.blocks.[0-9]+.mlp.fc2
    # Vision merger
    - visual.merger.mlp.0
    - visual.merger.mlp.2

lora:
  rank: 32
  alpha: 32
  dropout: 0.05
  task_type: CAUSAL_LM
  target_modules:
    # Language model
    - att_proj
    - ff_proj
    - attn_out
    - ff_out
    # Vision transformer
    - attention.wq
    - attention.wk
    - attention.wv
    - attention.wo
    - feed_forward.w1
    - feed_forward.w2
    # Image projector
    - vision_backbone.image_projector.w1
    - vision_backbone.image_projector.w2
    - vision_backbone.image_projector.w3

Set lora: null or omit the section entirely to perform full fine-tuning (not recommended due to memory requirements).

Data Configuration

DataConfig

Configures training and validation datasets.

train/core/config.py

@dataclass
class DataConfig:
    seed: int = 42
    cache_location: Optional[str] = None
    metric_for_best_model: Optional[str] = None
    sources: List[SourceConfig]

seed

integer

default:"42"

Random seed for data shuffling and augmentation.

cache_location

string

Local directory to cache downloaded PDFs. Improves data loading speed.Example: /data/pdf_cache

metric_for_best_model

string

Metric name for selecting the best checkpoint. Format: {source_name}_lossExample: validation_data_loss

sources

list[SourceConfig]

required

List of data sources to load.

SourceConfig

Configures individual data sources.

train/core/config.py

@dataclass
class SourceConfig:
    name: str
    response_glob_path: str
    target_longest_image_dim: list[int]
    target_anchor_text_len: list[int]

name

string

required

Name identifier for this data source.

response_glob_path

string

required

Glob pattern for OpenAI batch response JSON files. Supports S3 and local paths.Examples:

s3://bucket/train/*.json
/data/responses/*.json

target_longest_image_dim

list[int]

required

Image resolution(s) to render PDF pages to. Randomly selects from list during training.Examples:

[1024] - Fixed 1024px
[768, 1024, 1280] - Random augmentation

target_anchor_text_len

list[int]

required

Target length(s) for anchor text extraction. Randomly selects from list.Examples:

[6000] - Fixed 6000 characters
[4000, 6000, 8000] - Variable length

Example

train_data:
  seed: 1337
  cache_location: /data/pdfs
  sources:
    - name: arxiv_papers
      response_glob_path: s3://bucket/arxiv_train/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]
    - name: books
      response_glob_path: s3://bucket/books_train/*.json
      target_longest_image_dim: [768, 1024]
      target_anchor_text_len: [4000, 6000]

valid_data:
  cache_location: /data/pdfs
  metric_for_best_model: arxiv_papers_loss
  sources:
    - name: arxiv_papers
      response_glob_path: s3://bucket/arxiv_eval/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]

Hyperparameters

HyperparamConfig

Controls training dynamics.

train/core/config.py

@dataclass
class HyperparamConfig:
    batch_size: int = 8
    eval_batch_size: Optional[int] = None
    learning_rate: float = 2e-5
    max_steps: int = -1
    gradient_accumulation_steps: int = 1
    gradient_checkpointing: bool = False
    warmup_steps: int = 0
    warmup_ratio: float = 0.0
    weight_decay: float = 0.0
    clip_grad_norm: float = 0.0
    optim: str = "adamw_torch"
    lr_scheduler: str = "linear"
    log_every_steps: int = 5
    eval_every_steps: int = 100
    find_unused_parameters: bool = False

batch_size

integer

default:"8"

Batch size per GPU. For vision models, typically set to 1.

eval_batch_size

integer

Evaluation batch size. Defaults to same as batch_size.

learning_rate

float

default:"2e-5"

Initial learning rate for the optimizer.Recommended values:

1e-4: Conservative, stable
3e-4: More aggressive
5e-5: Very conservative

max_steps

integer

default:"-1"

Maximum number of training steps. -1 trains for full epochs.

gradient_accumulation_steps

integer

default:"1"

Number of steps to accumulate gradients. Effective batch size = batch_size * gradient_accumulation_steps * num_gpus

gradient_checkpointing

boolean

default:"false"

Enable gradient checkpointing to reduce memory at the cost of ~20% slower training.

warmup_steps

integer

default:"0"

Number of warmup steps. Mutually exclusive with warmup_ratio.

warmup_ratio

float

default:"0.0"

Fraction of training for warmup. E.g., 0.03 = 3% warmup.

weight_decay

float

default:"0.0"

Weight decay coefficient for regularization. Typical: 0.01

clip_grad_norm

float

default:"0.0"

Maximum gradient norm. 0.0 disables clipping. Typical: 1.0

optim

string

default:"adamw_torch"

Optimizer to use. Options: adamw_torch, adamw_hf, sgd, adafactor

lr_scheduler

string

default:"linear"

Learning rate scheduler. Options: linear, cosine, constant, polynomial

log_every_steps

integer

default:"5"

Log training metrics every N steps.

eval_every_steps

integer

default:"100"

Run evaluation every N steps.

find_unused_parameters

boolean

default:"false"

For DDP, find unused parameters. Required for Molmo, should be false for Qwen2-VL.

Example

hparams:
  batch_size: 1
  eval_batch_size: 1
  gradient_accumulation_steps: 4
  gradient_checkpointing: true
  clip_grad_norm: 1.0
  learning_rate: 1e-4
  max_steps: 10000
  log_every_steps: 10
  eval_every_steps: 100
  optim: adamw_torch
  lr_scheduler: cosine
  weight_decay: 0.01
  warmup_ratio: 0.03

Generation Configuration

GenerateConfig

Controls sequence length and generation parameters.

train/core/config.py

@dataclass
class GenerateConfig:
    max_length: int = 4096
    temperature: float = 0.2
    top_k: int = 50
    top_p: float = 1.0
    num_beams: int = 1

max_length

integer

default:"4096"

Maximum sequence length for training. Affects memory usage significantly.Common values:

4096: Standard for most documents
8192: Long documents
2048: Memory-constrained settings

temperature

float

default:"0.2"

Sampling temperature (used during inference, not training).

Example

generate:
  max_length: 8192
  temperature: 0.2

Save Configuration

SaveConfig

Controls checkpoint saving behavior.

train/core/config.py

@dataclass
class SaveConfig:
    path: str = "./results"
    limit: Optional[int] = None
    save_every_steps: int = "${hparams.eval_every_steps}"

path

string

default:"./results"

Output directory for checkpoints. Supports S3 paths.Examples:

s3://bucket/models/
/data/checkpoints/

limit

integer

Maximum number of checkpoints to keep. Older checkpoints are deleted.

save_every_steps

integer

Save checkpoint every N steps. Supports OmegaConf interpolation.

Example

save:
  path: s3://my-bucket/experiments/run-001/
  save_every_steps: 1000
  limit: 5  # Keep only 5 most recent checkpoints

AWS Configuration

AwsConfig

Credentials for S3 access.

train/core/config.py

@dataclass
class AwsConfig:
    profile: Optional[str] = None
    access_key_id: Optional[str] = None
    secret_access_key: Optional[str] = None
    default_region: Optional[str] = None

profile

string

AWS profile name from ~/.aws/credentials

access_key_id

string

AWS access key ID. Can also be set via AWS_ACCESS_KEY_ID environment variable.

secret_access_key

string

AWS secret access key. Can also be set via AWS_SECRET_ACCESS_KEY environment variable.

default_region

string

Default AWS region.

Example

aws:
  profile: my-profile
  default_region: us-west-2

Never commit credentials to version control! Use environment variables or AWS profiles.

WandB Configuration

WandbConfig

Weights & Biases experiment tracking.

train/core/config.py

@dataclass
class WandbConfig:
    entity: str = "ai2-llm"
    project: str = "pdf-qwen2vl"
    wandb_api_key: Optional[str] = None
    mode: str = "online"
    watch: str = "false"

entity

string

default:"ai2-llm"

WandB team/entity name.

project

string

default:"pdf-qwen2vl"

WandB project name.

wandb_api_key

string

WandB API key. Can also be set via WANDB_API_KEY environment variable.

mode

string

default:"online"

Logging mode. Options: online, offline, disabled

Example

wandb:
  entity: my-team
  project: document-ocr
  mode: online

Complete Configuration Example

model:
  name_or_path: Qwen/Qwen2-VL-7B-Instruct
  arch: causal
  use_flash_attn: true

lora:
  rank: 32
  alpha: 32
  dropout: 0.05
  task_type: CAUSAL_LM
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

wandb:
  project: my-ocr-project
  entity: my-team

aws:
  profile: default
  default_region: us-west-2

generate:
  max_length: 8192

train_data:
  seed: 1337
  cache_location: /data/pdfs
  sources:
    - name: training_set
      response_glob_path: s3://bucket/train/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]

valid_data:
  cache_location: /data/pdfs
  metric_for_best_model: training_set_loss
  sources:
    - name: training_set
      response_glob_path: s3://bucket/eval/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]

hparams:
  batch_size: 1
  gradient_accumulation_steps: 4
  gradient_checkpointing: true
  clip_grad_norm: 1.0
  learning_rate: 1e-4
  max_steps: 10000
  log_every_steps: 10
  eval_every_steps: 100
  optim: adamw_torch
  lr_scheduler: cosine
  weight_decay: 0.01
  warmup_ratio: 0.03

save:
  path: s3://bucket/models/
  save_every_steps: 1000
  limit: 5

max_workers: 10

Command-Line Overrides

You can override any configuration value from the command line:

python -m olmocr.train.train \
  --config config.yaml \
  --hparams.learning_rate 3e-4 \
  --hparams.max_steps 5000 \
  --lora.rank 64 \
  --save.path s3://new-bucket/models/

Next Steps

Training Overview

Learn about the training pipeline

Qwen2-VL Training

Fine-tune Qwen2-VL models

Molmo Training

Fine-tune Molmo models

Data Preparation

Prepare training datasets

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Overview

​Configuration Structure

​Model Configuration

​ModelConfig

​Example

​LoRA Configuration

​LoraConfig

​Target Modules Examples

​Data Configuration

​DataConfig

​SourceConfig

​Example

​Hyperparameters

​HyperparamConfig

​Example

​Generation Configuration

​GenerateConfig

​Example

​Save Configuration

​SaveConfig

​Example

​AWS Configuration

​AwsConfig

​Example

​WandB Configuration

​WandbConfig

​Example

​Complete Configuration Example

​Command-Line Overrides

​Next Steps

Training Overview

Qwen2-VL Training

Molmo Training

Data Preparation

Build docs developers (and LLMs) love

Overview

Configuration Structure

Model Configuration

ModelConfig

Example

LoRA Configuration

LoraConfig

Target Modules Examples

Data Configuration

DataConfig

SourceConfig

Example

Hyperparameters

HyperparamConfig

Example

Generation Configuration

GenerateConfig

Example

Save Configuration

SaveConfig

Example

AWS Configuration

AwsConfig

Example

WandB Configuration

WandbConfig

Example

Complete Configuration Example

Command-Line Overrides

Next Steps