Dataset Preparation

The training script expects a Hugging Face dataset with two columns:

images — a list of PIL images for the sample.
messages — a list of message dicts in the chat format your target model expects during inference.

The message format in your dataset must exactly match what the model sees at inference time. Using the wrong format will result in poor training signal and degraded outputs.

Dataset structure

{
  "images": ["<PIL.Image>", "..."],
  "messages": [
    {
      "role": "user",
      "content": "..."
    },
    {
      "role": "assistant",
      "content": "..."
    }
  ]
}

Each row in the dataset represents one training example. A single example can contain multiple images and multiple turns of conversation.

Message formats by model

The content structure inside each message varies by model family. Use the tab for your target model:

Qwen3 / Qwen2 VL
Mllama (Llama-3.2-Vision)
Deepseek-VL-V2
Pixtral / LLaVA

Content is a list of typed objects. Images are referenced inline within the user turn:

{
  "images": ["<image1>", "<image2>"],
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "<image1>"},
        {"type": "image", "image": "<image2>"},
        {"type": "text", "text": "What is in these images?"}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "The first image shows a cat sleeping on a couch. The second shows a dog in a garden."}
      ]
    }
  ]
}

Same typed-object structure as Qwen, with one image per turn:

{
  "images": ["<image1>"],
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "<image1>"},
        {"type": "text", "text": "Describe this image in detail."}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "The image shows..."}
      ]
    }
  ]
}

Deepseek uses special role tokens and an <image> placeholder inline in the content string. Images are also listed on the user message:

{
  "images": ["<image1>"],
  "messages": [
    {
      "role": "<|User|>",
      "content": "<image>\n<|ref|>What is this?<|/ref|>.",
      "images": ["<image1>"]
    },
    {
      "role": "<|Assistant|>",
      "content": "This is a detailed response."
    }
  ]
}

The <image> placeholder in the content string is where the vision encoder output is inserted. Include one placeholder per image in the turn.

Content is a plain string with an <image> placeholder at the position where the image embedding should appear:

{
  "images": ["<image1>"],
  "messages": [
    {
      "role": "user",
      "content": "<image>What is in this image?"
    },
    {
      "role": "assistant",
      "content": "This image shows..."
    }
  ]
}

Automatic dataset transformation

If your dataset has question and answer columns (plus an image or images column) instead of a messages column, the script can automatically convert it for Qwen and Deepseek models. For other model families, or to customise the transformation, provide a JSON template with --custom-prompt-format:

python lora.py \
    --model-path mlx-community/Qwen3-VL-2B-Instruct-bf16 \
    --dataset your-dataset-id \
    --custom-prompt-format '{"user": [{"type": "image","image":"{image}"}, {"type": "text","text":"{question}"}], "assistant": [{"type": "text","text":"{answer}"}]}'

The template supports three placeholder variables: {image}, {question}, and {answer}. These are filled from the corresponding dataset columns at runtime.

Creating a dataset programmatically

from datasets import Dataset
from PIL import Image

def prepare_qwen3_dataset(image_dir: str, annotations: list[dict]) -> Dataset:
    """
    annotations: list of {"image_path": str, "question": str, "answer": str}
    """
    dataset_dict = {"images": [], "messages": []}

    for item in annotations:
        image = Image.open(item["image_path"])

        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": item["question"]},
                ],
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": item["answer"]},
                ],
            },
        ]

        dataset_dict["images"].append(image)
        dataset_dict["messages"].append(messages)

    dataset = Dataset.from_dict(dataset_dict)
    return dataset


annotations = [
    {
        "image_path": "path/to/image_0.jpg",
        "question": "Describe this image",
        "answer": "Your annotation here",
    },
    # ...
]

dataset = prepare_qwen3_dataset("path/to/images", annotations)

# Push to the Hugging Face Hub
dataset.push_to_hub("your-username/your-dataset-name")

Common issues

Dataset must have a 'messages' column error

This error means the dataset has neither a messages column nor both question and answer columns. Either restructure your dataset to include a messages column in the correct format, or add question and answer columns and use --custom-prompt-format to define how they map to messages.

Poor training loss or degraded outputs

The most common cause is a mismatch between the message format in the dataset and what the model expects. Verify your format against the examples for your specific model family. Pay particular attention to role strings — for example, Deepseek uses "<|User|>" and "<|Assistant|>" rather than "user" and "assistant".

Using multi-image samples

For models that support multiple images per turn (Qwen2/3-VL, Mllama), list all images in the top-level images column and include one {"type": "image", "image": <img>} entry per image in the user content list, in the order the model should process them.

Get Started

Inference

Fine-Tuning

Advanced

Models

Dataset Preparation

Dataset structure

Message formats by model

Automatic dataset transformation

Creating a dataset programmatically

Common issues

Build docs developers (and LLMs) love

Get Started

Inference

Fine-Tuning

Advanced

Models

​Dataset structure

​Message formats by model

​Automatic dataset transformation

​Creating a dataset programmatically

​Common issues

Build docs developers (and LLMs) love

Dataset structure

Message formats by model

Automatic dataset transformation

Creating a dataset programmatically

Common issues