Skip to main content

Overview

Qwen2-VL is a family of efficient vision-language models available in 2B and 7B parameter sizes. olmOCR supports fine-tuning both variants using LoRA for parameter-efficient training.

Model Selection

Qwen2-VL-2B

Faster training and inference, suitable for resource-constrained environments

Qwen2-VL-7B

Better performance on complex documents, recommended for production

Quick Start

Basic Training Command

python -m olmocr.train.train \
  --config olmocr/train/config/qwen2vl-7b-lora.yaml

Configuration File

Create a configuration file qwen2vl-custom.yaml:
model:
  name_or_path: Qwen/Qwen2-VL-7B-Instruct
  arch: causal
  use_flash_attn: true

wandb:
  project: my-ocr-project
  entity: my-team

generate:
  max_length: 8192

train_data:
  seed: 1337
  cache_location: /path/to/pdf/cache
  sources:
    - name: my_training_data
      response_glob_path: s3://my-bucket/train/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]

valid_data:
  cache_location: /path/to/pdf/cache
  metric_for_best_model: my_eval_data_loss
  sources:
    - name: my_eval_data
      response_glob_path: s3://my-bucket/eval/*.json
      target_longest_image_dim: [1024]
      target_anchor_text_len: [6000]

hparams:
  batch_size: 1
  eval_batch_size: 1
  gradient_accumulation_steps: 4
  gradient_checkpointing: true
  clip_grad_norm: 1.0
  learning_rate: 1e-4
  max_steps: 10000
  log_every_steps: 10
  eval_every_steps: 100
  optim: adamw_torch
  lr_scheduler: cosine
  weight_decay: 0.01
  warmup_ratio: 0.03

lora:
  rank: 32
  alpha: 32
  dropout: 0.05
  task_type: causal_lm
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    - visual.blocks.[0-9]+.attn.qkv
    - visual.blocks.[0-9]+.attn.proj
    - visual.blocks.[0-9]+.mlp.fc1
    - visual.blocks.[0-9]+.mlp.fc2
    - visual.merger.mlp.0
    - visual.merger.mlp.2

save:
  path: s3://my-bucket/models/
  save_every_steps: 1000

max_workers: 10

Qwen2-VL Specific Configuration

Flash Attention

Qwen2-VL supports flash attention for faster training:
model:
  use_flash_attn: true
train/train.py
if "qwen" in config.model.name_or_path.lower():
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        config.model.name_or_path, 
        torch_dtype=torch.bfloat16, 
        _attn_implementation="flash_attention_2" if config.model.use_flash_attn else None
    )
Flash attention requires specific GPU architectures (Ampere or newer). Set to false if you encounter compatibility issues.

Target Modules for LoRA

Qwen2-VL requires LoRA adapters on both language and vision components:
- q_proj      # Query projection
- k_proj      # Key projection
- v_proj      # Value projection
- o_proj      # Output projection
- gate_proj   # MLP gate
- up_proj     # MLP up projection
- down_proj   # MLP down projection
- visual.blocks.[0-9]+.attn.qkv   # Vision attention
- visual.blocks.[0-9]+.attn.proj  # Vision projection
- visual.blocks.[0-9]+.mlp.fc1    # Vision MLP layer 1
- visual.blocks.[0-9]+.mlp.fc2    # Vision MLP layer 2
- visual.merger.mlp.0             # Vision merger layer 1
- visual.merger.mlp.2             # Vision merger layer 2

Image Resolution

Qwen2-VL processes images at configurable resolutions. Higher resolutions capture more detail but increase memory usage:
train_data:
  sources:
    - target_longest_image_dim: [1024]  # Single resolution
    # Or use multiple for data augmentation:
    - target_longest_image_dim: [768, 1024, 1280]
For documents with small text, use 1024 or higher. For simpler documents, 768 may suffice.

Data Format

Input Processing

Qwen2-VL uses a chat template format:
train/dataprep.py
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": base64_page_image},
            {"type": "text", "text": build_finetuning_prompt(anchor_text)},
        ],
    }
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Label Preparation

Labels are created by masking the input portion:
train/dataprep.py
# Concatenate input_ids and labels
input_ids = np.concatenate([inputs.input_ids[0], labels.input_ids[0]], axis=0)

# Create labels, masking the input portion with -100
labels_full = np.full_like(input_ids, fill_value=-100)
labels_full[len(inputs.input_ids[0]):] = labels.input_ids[0]

Collation

Qwen2-VL requires specific tensor formats:
train/utils.py
return {
    "input_ids": truncated_input_ids,
    "attention_mask": truncated_attention_mask,
    "labels": truncated_labels,
    "pixel_values": torch.tensor(batch[0]["pixel_values"]).unsqueeze(0),
    "image_grid_thw": torch.tensor(batch[0]["image_grid_thw"]).unsqueeze(0),
}

Training Examples

2B Model (Single GPU)

python -m olmocr.train.train \
  --model.name_or_path Qwen/Qwen2-VL-2B-Instruct \
  --model.use_flash_attn true \
  --hparams.batch_size 1 \
  --hparams.gradient_accumulation_steps 4 \
  --hparams.learning_rate 3e-4 \
  --hparams.max_steps 2000 \
  --lora.rank 32 \
  --lora.alpha 32 \
  --train_data.sources.0.response_glob_path s3://bucket/train/*.json \
  --valid_data.sources.0.response_glob_path s3://bucket/eval/*.json

7B Model (Multi-GPU)

torchrun --nproc_per_node=4 -m olmocr.train.train \
  --config olmocr/train/config/qwen2vl-7b-lora.yaml

Beaker Cluster

For distributed training on Beaker:
beaker experiment create \
  --name qwen2vl-7b-training \
  --task-image olmocr:latest \
  --task-command "python -m olmocr.train.train --config /config.yaml" \
  --gpus 8 \
  --priority high

Checkpoint Handling

Automatic Checkpointing

Checkpoints are saved automatically based on your configuration:
save:
  path: s3://my-bucket/models/
  save_every_steps: 1000
train/train.py
class CheckpointUploadCallback(TrainerCallback):
    def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        if state.is_local_process_zero:
            latest_checkpoint = get_last_checkpoint(args.output_dir)
            if not latest_checkpoint:
                return
            dir_name = Path(latest_checkpoint).name
            copy_dir(str(latest_checkpoint), f"{self.save_path}/{dir_name}")

Best Model Selection

The best model is selected based on validation loss:
valid_data:
  metric_for_best_model: my_eval_data_loss

Loading Checkpoints

Resume training from a checkpoint:
python -m olmocr.train.train \
  --config config.yaml \
  --resume_from_checkpoint s3://bucket/models/checkpoint-1000

Performance Optimization

Gradient Checkpointing

Enable for large models to reduce memory:
hparams:
  gradient_checkpointing: true

Gradient Accumulation

Simulate larger batch sizes:
hparams:
  batch_size: 1
  gradient_accumulation_steps: 4  # Effective batch size = 4

Mixed Precision

BFloat16 is used by default for better numerical stability:
train/train.py
training_args = TrainingArguments(
    bf16=True,
    # ... other args
)

Troubleshooting

Try these solutions:
  1. Enable gradient checkpointing
  2. Reduce batch size to 1
  3. Reduce target_longest_image_dim to 768
  4. Reduce max_length to 4096
  5. Use gradient accumulation instead of larger batches
If flash attention fails:
model:
  use_flash_attn: false
Increase workers and cache PDFs locally:
max_workers: 10
train_data:
  cache_location: /fast/local/storage

Next Steps

Configuration Reference

Explore all configuration options

Molmo Training

Try training Molmo models

Evaluation

Evaluate your trained models

Cluster Usage

Deploy your fine-tuned models at scale

Build docs developers (and LLMs) love