Fine-tuning Qwen2-VL

Overview

Qwen2-VL is a family of efficient vision-language models available in 2B and 7B parameter sizes. olmOCR supports fine-tuning both variants using LoRA for parameter-efficient training.

Model Selection

Qwen2-VL-2B

Faster training and inference, suitable for resource-constrained environments

Qwen2-VL-7B

Better performance on complex documents, recommended for production

Quick Start

Basic Training Command

python -m olmocr.train.train \
  --config olmocr/train/config/qwen2vl-7b-lora.yaml

Configuration File

Create a configuration file qwen2vl-custom.yaml:

model:
  name_or_path: Qwen/Qwen2-VL-7B-Instruct
  arch: causal
  use_flash_attn: true

wandb:
  project: my-ocr-project
  entity: my-team

generate:
  max_length: 8192

train_data:
  seed: 1337
  cache_location: /path/to/pdf/cache
  sources:
    - name: my_training_data
      response_glob_path: s3://my-bucket/train/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]

valid_data:
  cache_location: /path/to/pdf/cache
  metric_for_best_model: my_eval_data_loss
  sources:
    - name: my_eval_data
      response_glob_path: s3://my-bucket/eval/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]

hparams:
  batch_size: 1
  eval_batch_size: 1
  gradient_accumulation_steps: 4
  gradient_checkpointing: true
  clip_grad_norm: 1.0
  learning_rate: 1e-4
  max_steps: 10000
  log_every_steps: 10
  eval_every_steps: 100
  optim: adamw_torch
  lr_scheduler: cosine
  weight_decay: 0.01
  warmup_ratio: 0.03

lora:
  rank: 32
  alpha: 32
  dropout: 0.05
  task_type: causal_lm
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    - visual.blocks.[0-9]+.attn.qkv
    - visual.blocks.[0-9]+.attn.proj
    - visual.blocks.[0-9]+.mlp.fc1
    - visual.blocks.[0-9]+.mlp.fc2
    - visual.merger.mlp.0
    - visual.merger.mlp.2

save:
  path: s3://my-bucket/models/
  save_every_steps: 1000

max_workers: 10

Qwen2-VL Specific Configuration

Flash Attention

Qwen2-VL supports flash attention for faster training:

model:
  use_flash_attn: true

train/train.py

if "qwen" in config.model.name_or_path.lower():
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        config.model.name_or_path, 
        torch_dtype=torch.bfloat16, 
        _attn_implementation="flash_attention_2" if config.model.use_flash_attn else None
    )

Flash attention requires specific GPU architectures (Ampere or newer). Set to false if you encounter compatibility issues.

Target Modules for LoRA

Qwen2-VL requires LoRA adapters on both language and vision components:

Language Model Modules

- q_proj      # Query projection
- k_proj      # Key projection
- v_proj      # Value projection
- o_proj      # Output projection
- gate_proj   # MLP gate
- up_proj     # MLP up projection
- down_proj   # MLP down projection

Vision Transformer Modules

- visual.blocks.[0-9]+.attn.qkv   # Vision attention
- visual.blocks.[0-9]+.attn.proj  # Vision projection
- visual.blocks.[0-9]+.mlp.fc1    # Vision MLP layer 1
- visual.blocks.[0-9]+.mlp.fc2    # Vision MLP layer 2
- visual.merger.mlp.0             # Vision merger layer 1
- visual.merger.mlp.2             # Vision merger layer 2

Image Resolution

Qwen2-VL processes images at configurable resolutions. Higher resolutions capture more detail but increase memory usage:

train_data:
  sources:
    - target_longest_image_dim: [1024]  # Single resolution
    # Or use multiple for data augmentation:
    - target_longest_image_dim: [768, 1024, 1280]

For documents with small text, use 1024 or higher. For simpler documents, 768 may suffice.

Data Format

Input Processing

Qwen2-VL uses a chat template format:

train/dataprep.py

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": base64_page_image},
            {"type": "text", "text": build_finetuning_prompt(anchor_text)},
        ],
    }
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Label Preparation

Labels are created by masking the input portion:

train/dataprep.py

# Concatenate input_ids and labels
input_ids = np.concatenate([inputs.input_ids[0], labels.input_ids[0]], axis=0)

# Create labels, masking the input portion with -100
labels_full = np.full_like(input_ids, fill_value=-100)
labels_full[len(inputs.input_ids[0]):] = labels.input_ids[0]

Collation

Qwen2-VL requires specific tensor formats:

train/utils.py

return {
    "input_ids": truncated_input_ids,
    "attention_mask": truncated_attention_mask,
    "labels": truncated_labels,
    "pixel_values": torch.tensor(batch[0]["pixel_values"]).unsqueeze(0),
    "image_grid_thw": torch.tensor(batch[0]["image_grid_thw"]).unsqueeze(0),
}

Training Examples

2B Model (Single GPU)

python -m olmocr.train.train \
  --model.name_or_path Qwen/Qwen2-VL-2B-Instruct \
  --model.use_flash_attn true \
  --hparams.batch_size 1 \
  --hparams.gradient_accumulation_steps 4 \
  --hparams.learning_rate 3e-4 \
  --hparams.max_steps 2000 \
  --lora.rank 32 \
  --lora.alpha 32 \
  --train_data.sources.0.response_glob_path s3://bucket/train/*.json \
  --valid_data.sources.0.response_glob_path s3://bucket/eval/*.json

7B Model (Multi-GPU)

torchrun --nproc_per_node=4 -m olmocr.train.train \
  --config olmocr/train/config/qwen2vl-7b-lora.yaml

Beaker Cluster

For distributed training on Beaker:

beaker experiment create \
  --name qwen2vl-7b-training \
  --task-image olmocr:latest \
  --task-command "python -m olmocr.train.train --config /config.yaml" \
  --gpus 8 \
  --priority high

Checkpoint Handling

Automatic Checkpointing

Checkpoints are saved automatically based on your configuration:

save:
  path: s3://my-bucket/models/
  save_every_steps: 1000

train/train.py

class CheckpointUploadCallback(TrainerCallback):
    def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        if state.is_local_process_zero:
            latest_checkpoint = get_last_checkpoint(args.output_dir)
            if not latest_checkpoint:
                return
            dir_name = Path(latest_checkpoint).name
            copy_dir(str(latest_checkpoint), f"{self.save_path}/{dir_name}")

Best Model Selection

The best model is selected based on validation loss:

valid_data:
  metric_for_best_model: my_eval_data_loss

Loading Checkpoints

Resume training from a checkpoint:

python -m olmocr.train.train \
  --config config.yaml \
  --resume_from_checkpoint s3://bucket/models/checkpoint-1000

Performance Optimization

Gradient Checkpointing

Enable for large models to reduce memory:

hparams:
  gradient_checkpointing: true

Gradient Accumulation

Simulate larger batch sizes:

hparams:
  batch_size: 1
  gradient_accumulation_steps: 4  # Effective batch size = 4

Mixed Precision

BFloat16 is used by default for better numerical stability:

train/train.py

training_args = TrainingArguments(
    bf16=True,
    # ... other args
)

Troubleshooting

Out of Memory Errors

Try these solutions:

Enable gradient checkpointing
Reduce batch size to 1
Reduce target_longest_image_dim to 768
Reduce max_length to 4096
Use gradient accumulation instead of larger batches

Flash Attention Errors

If flash attention fails:

model:
  use_flash_attn: false

Slow Data Loading

Increase workers and cache PDFs locally:

max_workers: 10
train_data:
  cache_location: /fast/local/storage

Next Steps

Configuration Reference

Explore all configuration options

Molmo Training

Try training Molmo models

Evaluation

Evaluate your trained models

Cluster Usage

Deploy your fine-tuned models at scale

Get Started

Core Concepts

Usage Guides

Data Preparation

Training

Evaluation

​Overview

​Model Selection

Qwen2-VL-2B

Qwen2-VL-7B

​Quick Start

​Basic Training Command

​Configuration File

​Qwen2-VL Specific Configuration

​Flash Attention

​Target Modules for LoRA

​Image Resolution

​Data Format

​Input Processing

​Label Preparation

​Collation

​Training Examples

​2B Model (Single GPU)

​7B Model (Multi-GPU)

​Beaker Cluster

​Checkpoint Handling

​Automatic Checkpointing

​Best Model Selection

​Loading Checkpoints

​Performance Optimization

​Gradient Checkpointing

​Gradient Accumulation

​Mixed Precision

​Troubleshooting

​Next Steps

Configuration Reference

Molmo Training

Evaluation

Cluster Usage

Build docs developers (and LLMs) love

Overview

Model Selection

Quick Start

Basic Training Command

Configuration File

Qwen2-VL Specific Configuration

Flash Attention

Target Modules for LoRA

Image Resolution

Data Format

Input Processing

Label Preparation

Collation

Training Examples

2B Model (Single GPU)

7B Model (Multi-GPU)

Beaker Cluster

Checkpoint Handling

Automatic Checkpointing

Best Model Selection

Loading Checkpoints

Performance Optimization

Gradient Checkpointing

Gradient Accumulation

Mixed Precision

Troubleshooting

Next Steps