Learn how to prepare and format training data for fine-tuning Qwen models
Proper data preparation is critical for successful fine-tuning. This guide covers the data format, preprocessing, and best practices for creating high-quality training datasets.
Qwen uses a standardized JSON conversation format that follows the ChatML structure. Each training sample consists of a conversation with multiple turns between user and assistant.
You can include multiple user-assistant exchanges in a single conversation:
[ { "id": "multi_turn_example", "conversations": [ { "from": "user", "value": "What is the capital of France?" }, { "from": "assistant", "value": "The capital of France is Paris." }, { "from": "user", "value": "What is its population?" }, { "from": "assistant", "value": "Paris has a population of approximately 2.2 million people within the city limits, and over 12 million in the metropolitan area." } ] }]
Multi-turn conversations help the model learn context tracking and coherent dialogue flow.
The training script automatically converts your JSON data into ChatML format during preprocessing. Understanding this helps you prepare better training data.
Your JSON conversations are transformed into this structure:
<|im_start|>systemYou are a helpful assistant.<|im_end|><|im_start|>userUser message here<|im_end|><|im_start|>assistantAssistant response here<|im_end|>
Identify what you want the model to learn. Examples:
Customer service responses
Code generation in a specific framework
Domain-specific question answering
Style transfer or tone adaptation
2
Collect Raw Data
Gather examples of high-quality interactions:
# Example: Converting Q&A pairs to conversation formatqa_pairs = [ {"question": "How do I install Python?", "answer": "You can download..."}, {"question": "What is a virtual environment?", "answer": "A virtual environment..."}]
# Validation scriptwith open("training_data.json", "r") as f: data = json.load(f)for item in data: assert "id" in item, "Missing id field" assert "conversations" in item, "Missing conversations field" for conv in item["conversations"]: assert conv["from"] in ["user", "assistant"], "Invalid role" assert "value" in conv, "Missing value field"print(f"✓ Validated {len(data)} conversations")
[ { "id": "code_example_1", "conversations": [ { "from": "user", "value": "Write a Python function to calculate fibonacci numbers" }, { "from": "assistant", "value": "Here's a Python function to calculate Fibonacci numbers:\n\n```python\ndef fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)\n```\n\nFor better performance, here's an iterative version:\n\n```python\ndef fibonacci_iterative(n):\n if n <= 1:\n return n\n a, b = 0, 1\n for _ in range(2, n + 1):\n a, b = b, a + b\n return b\n```" } ] }]
[ { "id": "multilingual_1", "conversations": [ { "from": "user", "value": "你好,请介绍一下自己" }, { "from": "assistant", "value": "你好!我是通义千问,是阿里云开发的大型语言模型。我可以回答问题、进行对话、协助创作等。" }, { "from": "user", "value": "Can you also speak English?" }, { "from": "assistant", "value": "Yes, I can communicate in English as well. I'm a multilingual AI assistant developed by Alibaba Cloud, capable of understanding and responding in multiple languages including Chinese and English." } ] }]
Qwen models are pretrained on multilingual data with focus on Chinese and English. They can handle code-switching naturally.
Here’s a production-ready script for converting various formats to Qwen training format:
import jsonimport argparsefrom pathlib import Pathdef convert_to_qwen_format(input_file, output_file, format_type="qa"): """Convert various formats to Qwen training format. Args: input_file: Path to input data output_file: Path to save formatted data format_type: Type of input format (qa, chat, instruction) """ with open(input_file, 'r', encoding='utf-8') as f: raw_data = json.load(f) formatted_data = [] if format_type == "qa": # Question-answer pairs for idx, item in enumerate(raw_data): formatted_data.append({ "id": f"qa_{idx}", "conversations": [ {"from": "user", "value": item["question"]}, {"from": "assistant", "value": item["answer"]} ] }) elif format_type == "instruction": # Instruction-following format for idx, item in enumerate(raw_data): user_msg = item.get("instruction", "") if "input" in item and item["input"]: user_msg += f"\n\n{item['input']}" formatted_data.append({ "id": f"inst_{idx}", "conversations": [ {"from": "user", "value": user_msg}, {"from": "assistant", "value": item["output"]} ] }) elif format_type == "chat": # Already in conversation format, just validate formatted_data = raw_data # Validate and save for item in formatted_data: assert "id" in item and "conversations" in item for conv in item["conversations"]: assert conv["from"] in ["user", "assistant"] assert "value" in conv with open(output_file, 'w', encoding='utf-8') as f: json.dump(formatted_data, f, ensure_ascii=False, indent=2) print(f"✓ Converted {len(formatted_data)} conversations to {output_file}")if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--input", required=True, help="Input data file") parser.add_argument("--output", required=True, help="Output data file") parser.add_argument("--format", choices=["qa", "chat", "instruction"], default="qa", help="Input format type") args = parser.parse_args() convert_to_qwen_format(args.input, args.output, args.format)