Overview
By default, nanochat is trained on general internet text and doesn’t know anything about itself. You can teach it a custom identity and personality by generating synthetic conversation data and mixing it into the supervised fine-tuning (SFT) stage. This guide is based on the Guide: infusing identity to your nanochat discussion.The Identity Pipeline
Step 1: Create Knowledge Base
Create a markdown file with comprehensive facts about your bot’s identity:- Be comprehensive - include everything the bot should know
- Be accurate - only include facts you want the bot to believe
- Include limitations - teach honest self-awareness
Step 2: Generate Synthetic Conversations
nanochat includes a synthetic data generator indev/gen_synthetic_data.py that uses an external LLM to create training conversations.
Setup
Get an API key from OpenRouter and set it:Generate Conversations
--num: Number of conversations to generate (default: 1000)--workers: Parallel API requests (default: 4)--output: Output file path (default:identity_conversations.jsonl)--append: Append to existing file instead of overwriting--save-metadata: Include generation metadata for debugging
dev/gen_synthetic_data.py:403-412.
How It Works
The script ensures diversity through multiple dimensions:1. Topic Categories
dev/gen_synthetic_data.py:54-131.
2. User Personas
dev/gen_synthetic_data.py:134-147.
3. Conversation Dynamics
dev/gen_synthetic_data.py:150-161.
4. First Message Variety
dev/gen_synthetic_data.py:164-213.
Output Format
Generated conversations are saved in JSONL format (one JSON object per line):Style Guidelines
The generator follows these principles (fromdev/gen_synthetic_data.py:238-246):
- Plain ASCII only - No emojis or special characters
- Natural conversation - Real chat, not formal Q&A
- Accurate facts - Only from knowledge base
- Appropriate depth - Match user persona
- Honest about limitations - Clear about what it can’t do
- Personality - Helpful, clear, slightly enthusiastic
Step 3: Train with Custom Data
Mix your synthetic data into the SFT training using theCustomJSON task:
Mixing Ratios
Balance general capability vs. identity:Step 4: Verify Identity
Chat with your model to verify it learned the identity:- Basic identity: “Who are you?” / “What’s your name?”
- Creator: “Who made you?” / “Who’s your creator?”
- Capabilities: “What can you help me with?”
- Limitations: “Can you browse the internet?” / “Do you remember our past chats?”
- Technical: “How many parameters do you have?” / “What GPU were you trained on?”
- Personality: Does it sound like your intended personality?
- Update knowledge base with corrections
- Generate new synthetic data
- Retrain SFT with updated mix
Advanced Techniques
Multi-Language Identity
Add multilingual greetings to support non-English users:Domain-Specific Identity
Customize for specific domains:Personality Tuning
Adjust personality through system prompts and examples:dev/gen_synthetic_data.py:134-147, customize this list.
Quality Control
Validate generated conversations:dev/gen_synthetic_data.py:383-396.
Cost Estimates
Generating 1000 conversations:| Model | Cost per 1K conversations | Quality |
|---|---|---|
| GPT-4 Turbo | ~$5-10 | Excellent |
| GPT-3.5 Turbo | ~$0.50-1 | Good |
| Gemini Flash | ~$0.10-0.25 | Good (used in nanochat) |
| Claude Sonnet | ~$3-6 | Excellent |
google/gemini-3-flash-preview by default for cost-effectiveness.
From dev/gen_synthetic_data.py:301-306:
Troubleshooting
”API key not found”
Set your OpenRouter API key:Repetitive Conversations
Increase diversity:- Add more topics to the topic list
- Add more personas
- Increase temperature (default: 1.0, try 1.2)
- Generate multiple batches with different seeds
Bot Doesn’t Remember Identity
Increase identity data proportion:Bot Hallucinates Facts
Ensure knowledge base is accurate:API Rate Limits
Reduce parallel workers:Example: nanochat’s Identity
nanochat itself uses this technique! From the README.md (lines 89-91):To customize your nanochat, see Guide: infusing identity to your nanochat in Discussions, which describes how you can tune your nanochat’s personality through synthetic data generation and mixing that data into the SFT stage.The example script in
dev/gen_synthetic_data.py shows how to teach nanochat about:
- Its name and creator (Andrej Karpathy)
- Architecture details (transformer, RoPE, Flash Attention)
- Training cost ($72 for GPT-2 capability)
- Capabilities and limitations
- Open source nature (MIT license)
Further Reading
dev/gen_synthetic_data.py- Full synthetic data generation scripttasks/customjson.py- Load custom JSONL conversations for training- Guide: infusing identity to your nanochat - Original discussion
- OpenRouter - LLM API marketplace
scripts/chat_sft.py- Supervised fine-tuning script