Skip to main content
Proper data preparation is critical for successful fine-tuning. This guide covers the data format, preprocessing, and best practices for creating high-quality training datasets.

Data Format

Qwen uses a standardized JSON conversation format that follows the ChatML structure. Each training sample consists of a conversation with multiple turns between user and assistant.

Basic Structure

[
  {
    "id": "unique_identifier",
    "conversations": [
      {
        "from": "user",
        "value": "User message here"
      },
      {
        "from": "assistant",
        "value": "Assistant response here"
      }
    ]
  }
]

Field Descriptions

id
string
required
A unique identifier for each conversation. Used for logging and debugging.
conversations
array
required
An array of conversation turns. Must contain at least one user-assistant pair.
conversations[].from
string
required
The role of the speaker. Must be either "user" or "assistant".
conversations[].value
string
required
The actual message content. Can contain multi-line text, code, or any UTF-8 content.

Multi-turn Conversations

You can include multiple user-assistant exchanges in a single conversation:
[
  {
    "id": "multi_turn_example",
    "conversations": [
      {
        "from": "user",
        "value": "What is the capital of France?"
      },
      {
        "from": "assistant",
        "value": "The capital of France is Paris."
      },
      {
        "from": "user",
        "value": "What is its population?"
      },
      {
        "from": "assistant",
        "value": "Paris has a population of approximately 2.2 million people within the city limits, and over 12 million in the metropolitan area."
      }
    ]
  }
]
Multi-turn conversations help the model learn context tracking and coherent dialogue flow.

ChatML Format Processing

The training script automatically converts your JSON data into ChatML format during preprocessing. Understanding this helps you prepare better training data.

Automatic Template Application

Your JSON conversations are transformed into this structure:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
User message here<|im_end|>
<|im_start|>assistant
Assistant response here<|im_end|>

System Message

The default system message is "You are a helpful assistant." This is applied automatically by the preprocessing function in finetune.py:125-176.
The system message is currently hardcoded in the training script. To customize it, modify the system_message parameter in the preprocess() function.

Token Masking Strategy

The training script implements intelligent token masking to focus learning on assistant responses:

What Gets Masked (Not Trained)

  • System message content
  • User message content
  • Special tokens (<|im_start|>, role names)
  • Padding tokens

What Gets Trained

  • Only the assistant’s responses (excluding special tokens)
This is implemented using IGNORE_TOKEN_ID in the preprocessing function:
if role == '<|im_start|>user':
    _target = [im_start] + [IGNORE_TOKEN_ID] * (len(_input_id)-3) + [im_end] + nl_tokens
elif role == '<|im_start|>assistant':
    _target = [im_start] + [IGNORE_TOKEN_ID] * len(tokenizer(role).input_ids) + \
        _input_id[len(tokenizer(role).input_ids)+1:-2] + [im_end] + nl_tokens
This masking strategy improves training efficiency by focusing gradient updates on the assistant’s outputs.

Creating Training Data

Step-by-Step Example

1

Define Your Task

Identify what you want the model to learn. Examples:
  • Customer service responses
  • Code generation in a specific framework
  • Domain-specific question answering
  • Style transfer or tone adaptation
2

Collect Raw Data

Gather examples of high-quality interactions:
# Example: Converting Q&A pairs to conversation format
qa_pairs = [
    {"question": "How do I install Python?", "answer": "You can download..."},
    {"question": "What is a virtual environment?", "answer": "A virtual environment..."}
]
3

Format as JSON

Convert to the required conversation structure:
import json

training_data = []
for idx, qa in enumerate(qa_pairs):
    training_data.append({
        "id": f"qa_{idx}",
        "conversations": [
            {"from": "user", "value": qa["question"]},
            {"from": "assistant", "value": qa["answer"]}
        ]
    })

with open("training_data.json", "w", encoding="utf-8") as f:
    json.dump(training_data, f, ensure_ascii=False, indent=2)
4

Validate Format

Verify your data is correctly formatted:
# Validation script
with open("training_data.json", "r") as f:
    data = json.load(f)

for item in data:
    assert "id" in item, "Missing id field"
    assert "conversations" in item, "Missing conversations field"
    for conv in item["conversations"]:
        assert conv["from"] in ["user", "assistant"], "Invalid role"
        assert "value" in conv, "Missing value field"

print(f"✓ Validated {len(data)} conversations")

Data Quality Guidelines

Best Practices

  • Consistent tone: Maintain consistent assistant personality
  • Factual accuracy: Verify all information is correct
  • Natural flow: Ensure conversations feel natural, not scripted
  • Diverse examples: Cover various phrasings and edge cases
  • Typical range: 50-500 tokens per response
  • Max length: Configure with --model_max_length (default: 8192)
  • Avoid truncation: Ensure important conversations fit within max length
  • Balance: Mix short and long responses
  • Minimum: 100-500 examples for domain adaptation
  • Recommended: 1,000-10,000 examples for robust fine-tuning
  • Large-scale: 10,000+ examples for significant behavior changes
  • Quality > Quantity: 100 high-quality examples beat 1,000 poor ones
  • Topic coverage: Include all relevant domains
  • Query variations: Multiple ways to ask the same thing
  • Response styles: Different levels of detail when appropriate
  • Edge cases: Include challenging or unusual scenarios

Special Use Cases

Code Generation

[
  {
    "id": "code_example_1",
    "conversations": [
      {
        "from": "user",
        "value": "Write a Python function to calculate fibonacci numbers"
      },
      {
        "from": "assistant",
        "value": "Here's a Python function to calculate Fibonacci numbers:\n\n```python\ndef fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)\n```\n\nFor better performance, here's an iterative version:\n\n```python\ndef fibonacci_iterative(n):\n    if n <= 1:\n        return n\n    a, b = 0, 1\n    for _ in range(2, n + 1):\n        a, b = b, a + b\n    return b\n```"
      }
    ]
  }
]

Multilingual Data

[
  {
    "id": "multilingual_1",
    "conversations": [
      {
        "from": "user",
        "value": "你好,请介绍一下自己"
      },
      {
        "from": "assistant",
        "value": "你好!我是通义千问,是阿里云开发的大型语言模型。我可以回答问题、进行对话、协助创作等。"
      },
      {
        "from": "user",
        "value": "Can you also speak English?"
      },
      {
        "from": "assistant",
        "value": "Yes, I can communicate in English as well. I'm a multilingual AI assistant developed by Alibaba Cloud, capable of understanding and responding in multiple languages including Chinese and English."
      }
    ]
  }
]
Qwen models are pretrained on multilingual data with focus on Chinese and English. They can handle code-switching naturally.

Data Preprocessing Options

Lazy Preprocessing

The training script supports two preprocessing modes:
# Preprocesses data on-the-fly during training
python finetune.py \
  --data_path data.json \
  --lazy_preprocess True
Lazy preprocessing (LazySupervisedDataset):
  • Lower initial memory usage
  • Faster training startup
  • Recommended for large datasets
  • Caches processed samples
Eager preprocessing (SupervisedDataset):
  • All data processed upfront
  • Slightly faster iteration
  • Better for smaller datasets

Validation Data

You can optionally provide evaluation data to monitor training progress:
python finetune.py \
  --data_path train.json \
  --eval_data_path eval.json \
  --evaluation_strategy "steps" \
  --eval_steps 500
Your eval.json should follow the same format as training data.
Evaluation during training increases overall training time. For production fine-tuning, consider evaluating only at the end or at sparse intervals.

Complete Data Preparation Script

Here’s a production-ready script for converting various formats to Qwen training format:
import json
import argparse
from pathlib import Path

def convert_to_qwen_format(input_file, output_file, format_type="qa"):
    """Convert various formats to Qwen training format.
    
    Args:
        input_file: Path to input data
        output_file: Path to save formatted data
        format_type: Type of input format (qa, chat, instruction)
    """
    with open(input_file, 'r', encoding='utf-8') as f:
        raw_data = json.load(f)
    
    formatted_data = []
    
    if format_type == "qa":
        # Question-answer pairs
        for idx, item in enumerate(raw_data):
            formatted_data.append({
                "id": f"qa_{idx}",
                "conversations": [
                    {"from": "user", "value": item["question"]},
                    {"from": "assistant", "value": item["answer"]}
                ]
            })
    
    elif format_type == "instruction":
        # Instruction-following format
        for idx, item in enumerate(raw_data):
            user_msg = item.get("instruction", "")
            if "input" in item and item["input"]:
                user_msg += f"\n\n{item['input']}"
            
            formatted_data.append({
                "id": f"inst_{idx}",
                "conversations": [
                    {"from": "user", "value": user_msg},
                    {"from": "assistant", "value": item["output"]}
                ]
            })
    
    elif format_type == "chat":
        # Already in conversation format, just validate
        formatted_data = raw_data
    
    # Validate and save
    for item in formatted_data:
        assert "id" in item and "conversations" in item
        for conv in item["conversations"]:
            assert conv["from"] in ["user", "assistant"]
            assert "value" in conv
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(formatted_data, f, ensure_ascii=False, indent=2)
    
    print(f"✓ Converted {len(formatted_data)} conversations to {output_file}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", required=True, help="Input data file")
    parser.add_argument("--output", required=True, help="Output data file")
    parser.add_argument("--format", choices=["qa", "chat", "instruction"], 
                       default="qa", help="Input format type")
    args = parser.parse_args()
    
    convert_to_qwen_format(args.input, args.output, args.format)
Usage:
python convert_data.py \
  --input raw_qa.json \
  --output training_data.json \
  --format qa

Next Steps

Full-Parameter Tuning

Start training with full-parameter fine-tuning

LoRA Fine-tuning

Efficient training with LoRA adapters

Build docs developers (and LLMs) love