Skip to main content
The training script expects a Hugging Face dataset with two columns:
  • images — a list of PIL images for the sample.
  • messages — a list of message dicts in the chat format your target model expects during inference.
The message format in your dataset must exactly match what the model sees at inference time. Using the wrong format will result in poor training signal and degraded outputs.

Dataset structure

{
  "images": ["<PIL.Image>", "..."],
  "messages": [
    {
      "role": "user",
      "content": "..."
    },
    {
      "role": "assistant",
      "content": "..."
    }
  ]
}
Each row in the dataset represents one training example. A single example can contain multiple images and multiple turns of conversation.

Message formats by model

The content structure inside each message varies by model family. Use the tab for your target model:
Content is a list of typed objects. Images are referenced inline within the user turn:
{
  "images": ["<image1>", "<image2>"],
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "<image1>"},
        {"type": "image", "image": "<image2>"},
        {"type": "text", "text": "What is in these images?"}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "The first image shows a cat sleeping on a couch. The second shows a dog in a garden."}
      ]
    }
  ]
}

Automatic dataset transformation

If your dataset has question and answer columns (plus an image or images column) instead of a messages column, the script can automatically convert it for Qwen and Deepseek models. For other model families, or to customise the transformation, provide a JSON template with --custom-prompt-format:
python lora.py \
    --model-path mlx-community/Qwen3-VL-2B-Instruct-bf16 \
    --dataset your-dataset-id \
    --custom-prompt-format '{"user": [{"type": "image","image":"{image}"}, {"type": "text","text":"{question}"}], "assistant": [{"type": "text","text":"{answer}"}]}'
The template supports three placeholder variables: {image}, {question}, and {answer}. These are filled from the corresponding dataset columns at runtime.

Creating a dataset programmatically

from datasets import Dataset
from PIL import Image

def prepare_qwen3_dataset(image_dir: str, annotations: list[dict]) -> Dataset:
    """
    annotations: list of {"image_path": str, "question": str, "answer": str}
    """
    dataset_dict = {"images": [], "messages": []}

    for item in annotations:
        image = Image.open(item["image_path"])

        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": item["question"]},
                ],
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": item["answer"]},
                ],
            },
        ]

        dataset_dict["images"].append(image)
        dataset_dict["messages"].append(messages)

    dataset = Dataset.from_dict(dataset_dict)
    return dataset


annotations = [
    {
        "image_path": "path/to/image_0.jpg",
        "question": "Describe this image",
        "answer": "Your annotation here",
    },
    # ...
]

dataset = prepare_qwen3_dataset("path/to/images", annotations)

# Push to the Hugging Face Hub
dataset.push_to_hub("your-username/your-dataset-name")

Common issues

This error means the dataset has neither a messages column nor both question and answer columns. Either restructure your dataset to include a messages column in the correct format, or add question and answer columns and use --custom-prompt-format to define how they map to messages.
The most common cause is a mismatch between the message format in the dataset and what the model expects. Verify your format against the examples for your specific model family. Pay particular attention to role strings — for example, Deepseek uses "<|User|>" and "<|Assistant|>" rather than "user" and "assistant".
For models that support multiple images per turn (Qwen2/3-VL, Mllama), list all images in the top-level images column and include one {"type": "image", "image": <img>} entry per image in the user content list, in the order the model should process them.

Build docs developers (and LLMs) love