LLM Training: Phi-3 and GenAI

Learn to fine-tune modern large language models using parameter-efficient techniques like LoRA. This guide focuses on Microsoft’s Phi-3, a compact yet powerful LLM.

Overview

The generative example demonstrates:

Fine-tuning small LLMs (Phi-3)
LoRA (Low-Rank Adaptation) for efficient training
Instruction-following dataset preparation
Supervised Fine-Tuning (SFT) with TRL
Model deployment and inference

Example Model: Microsoft Phi-3-mini-4k-instructTask: Text-to-SQL generation with instruction followingTechnique: LoRA with SFTTrainer from TRL library

Why Phi-3?

Compact Size

3.8B parameters, runs on consumer GPUs

Strong Performance

Competitive with much larger models

Instruction Following

Pre-trained for chat and task completion

Open License

MIT license, freely usable

Quick Start

# Navigate to generative example
cd module-3/generative-example

# Build container
make build

# Run with GPU
make run_dev_gpu

# Inside container:
export PYTHONPATH=.
export WANDB_PROJECT=ml-in-production-practice
export WANDB_API_KEY=your_key

# Train with LoRA
python generative_example/cli.py train ./conf/example.json

Project Structure

generative-example/
├── generative_example/
│   ├── __init__.py
│   ├── cli.py               # Command-line interface
│   ├── config.py            # LoRA configuration
│   ├── data.py              # Dataset preparation
│   ├── train.py             # SFT training with LoRA
│   ├── predictor.py         # Inference wrapper
│   └── utils.py
├── conf/
│   └── example.json         # Training configuration
├── tests/
└── requirements.txt

Configuration

LoRA-specific configuration:

config.py

from dataclasses import dataclass

@dataclass
class ModelArguments:
    model_id: str                    # HuggingFace model ID
    lora_r: int                      # LoRA rank (16)
    lora_alpha: int                  # LoRA scaling (16)
    lora_dropout: float              # Dropout rate (0.05)

@dataclass
class DataTrainingArguments:
    train_file: str                  # JSONL training data
    test_file: str                   # JSONL test data

Training Configuration

conf/example.json

{
  "train_file": "./data/train.json",
  "test_file": "./data/test.json",
  
  "model_id": "microsoft/Phi-3-mini-4k-instruct",
  "lora_r": 16,
  "lora_alpha": 16,
  "lora_dropout": 0.05,
  
  "output_dir": "./phi-3-mini-lora-text2sql",
  "per_device_train_batch_size": 8,
  "per_device_eval_batch_size": 8,
  "gradient_accumulation_steps": 4,
  
  "learning_rate": 0.0001,
  "num_train_epochs": 3,
  "warmup_ratio": 0.1,
  
  "eval_strategy": "steps",
  "eval_steps": 100,
  "logging_steps": 100,
  "save_steps": 500,
  
  "bf16": true,
  "report_to": ["wandb"],
  "seed": 42
}

Dataset Preparation

Format data for instruction following:

train.py

def create_message_column(row):
    """Convert data to chat format."""
    messages = []
    
    # User message with context and question
    user = {
        "content": f"{row['context']}\n Input: {row['question']}",
        "role": "user"
    }
    messages.append(user)
    
    # Assistant response
    assistant = {
        "content": f"{row['answer']}",
        "role": "assistant"
    }
    messages.append(assistant)
    
    return {"messages": messages}

def format_dataset_chatml(row, tokenizer):
    """Apply chat template."""
    return {
        "text": tokenizer.apply_chat_template(
            row["messages"],
            add_generation_prompt=False,
            tokenize=False
        )
    }

def process_dataset(model_id: str, train_file: str, test_file: str):
    """Load and format datasets."""
    dataset = DatasetDict({
        "train": Dataset.from_json(train_file),
        "test": Dataset.from_json(test_file),
    })
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.padding_side = "right"
    
    # Convert to chat format
    dataset_chatml = dataset.map(create_message_column)
    dataset_chatml = dataset_chatml.map(
        partial(format_dataset_chatml, tokenizer=tokenizer)
    )
    
    return dataset_chatml

Input Data Format (JSONL):

{
  "context": "Database schema: users(id, name, email)",
  "question": "Find all user emails",
  "answer": "SELECT email FROM users"
}

Model Loading

Load Phi-3 with optimizations:

train.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_model(model_id: str, device_map):
    """Load Phi-3 model with optimizations."""
    # Determine compute dtype
    if torch.cuda.is_bf16_supported():
        compute_dtype = torch.bfloat16
        attn_implementation = "flash_attention_2"
    else:
        compute_dtype = torch.float16
        attn_implementation = "sdpa"
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_id,
        trust_remote_code=True,
        add_eos_token=True,
        use_fast=True
    )
    tokenizer.pad_token = tokenizer.unk_token
    tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(
        tokenizer.pad_token
    )
    tokenizer.padding_side = "left"
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=compute_dtype,
        trust_remote_code=True,
        device_map=device_map,
        attn_implementation=attn_implementation,
    )
    
    return tokenizer, model

LoRA Configuration

Configure parameter-efficient fine-tuning:

train.py

from peft import LoraConfig, TaskType

def setup_lora(model_args):
    """Configure LoRA for Phi-3."""
    target_modules = [
        "k_proj",    # Key projection
        "q_proj",    # Query projection
        "v_proj",    # Value projection
        "o_proj",    # Output projection
        "gate_proj", # Gate projection (MLP)
        "down_proj", # Down projection (MLP)
        "up_proj",   # Up projection (MLP)
    ]
    
    peft_config = LoraConfig(
        r=model_args.lora_r,              # Rank (16)
        lora_alpha=model_args.lora_alpha,  # Scaling (16)
        lora_dropout=model_args.lora_dropout, # Dropout (0.05)
        task_type=TaskType.CAUSAL_LM,
        target_modules=target_modules,
    )
    
    return peft_config

LoRA reduces trainable parameters by ~99% while maintaining performance. For Phi-3 (3.8B params), LoRA trains only ~38M parameters.

Training Loop

Use TRL’s SFTTrainer:

train.py

from trl import SFTTrainer
from transformers import TrainingArguments, set_seed

def train(config_path: Path):
    """Train Phi-3 with LoRA."""
    # Load config
    model_args, data_args, training_args = get_config(config_path)
    
    # Set seed
    set_seed(training_args.seed)
    
    # Prepare dataset
    dataset_chatml = process_dataset(
        model_id=model_args.model_id,
        train_file=data_args.train_file,
        test_file=data_args.test_file,
    )
    
    # Load model
    device_map = {"":0}  # Single GPU
    tokenizer, model = get_model(
        model_id=model_args.model_id,
        device_map=device_map
    )
    
    # Configure LoRA
    peft_config = setup_lora(model_args)
    
    # Create SFT trainer
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset_chatml["train"],
        eval_dataset=dataset_chatml["test"],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_args,
    )
    
    # Train
    trainer.train()
    trainer.save_model()
    trainer.create_model_card()

Inference

Run inference with fine-tuned model:

predictor.py

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model(model_path: str):
    """Load fine-tuned model."""
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    return tokenizer, model

def generate(prompt: str, tokenizer, model, max_length=256):
    """Generate completion."""
    messages = [{"role": "user", "content": prompt}]
    
    # Format with chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_length,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
    
    # Decode
    response = tokenizer.decode(
        outputs[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    )
    
    return response

LLM API Examples

The module includes API examples for different LLM backends:

Phi-3 Local
OpenAI API

generative-api/pipeline_phi3.py

import torch
from transformers import pipeline

# Load model
pipe = pipeline(
    "text-generation",
    model="microsoft/Phi-3-mini-4k-instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Generate
messages = [
    {"role": "user", "content": "Write a SQL query"}
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=0.7,
)

print(outputs[0]["generated_text"])

generative-api/pipeline_api.py

from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": "Write a SQL query"}
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Run API examples:

# Local Phi-3
python generative-api/pipeline_phi3.py ./data/test.json

# OpenAI API
export OPENAI_API_KEY=your_key
python generative-api/pipeline_api.py ./data/test.json

LLM Testing

Test LLM outputs with specialized tools:

DeepEval

LLM evaluation framework with metrics for hallucination, bias, toxicity

Promptfoo

Test and evaluate LLM prompts and outputs

Ragas

RAG evaluation framework

UpTrain

Evaluate and improve LLM applications

Distributed Training

Scale to multiple GPUs:

PyTorch DDP
Accelerate
DeepSpeed

# Multi-GPU training
torchrun --nproc_per_node=4 \
    generative_example/cli.py train ./conf/example.json

# Configure accelerate
accelerate config

# Launch training
accelerate launch \
    generative_example/cli.py train ./conf/example.json

deepspeed_config.json

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 4,
  "fp16": {"enabled": true},
  "zero_optimization": {
    "stage": 2
  }
}

deepspeed --num_gpus=4 \
    generative_example/cli.py train ./conf/example.json

Popular LLM Models

Phi-3

3.8B params, MIT license, 128k context

Llama 3

8B params, strong performance

Mistral 7B

7B params, Apache 2.0 license

Gemma 2

9B params, commercial use allowed

Browse the Open LLM Leaderboard for more models.

Resources

Phi-3 Cookbook

Official recipes and examples

PEFT Documentation

Parameter-efficient fine-tuning methods

TRL Library

Transformer Reinforcement Learning

Distributed Training Guide

Multi-GPU training strategies

Next Steps

Practice Exercises

Complete hands-on homework assignments

Module 1: Infrastructure

Module 2: Data Management

Module 3: Training Workflows

Module 4: Pipeline Orchestration

Module 5: Model Serving

Module 6: Optimization

Module 7: Monitoring

Module 8: Cloud Platforms

​LLM Training: Phi-3 and GenAI

​Overview

​Why Phi-3?

Compact Size

Strong Performance

Instruction Following

Open License

​Quick Start

​Project Structure

​Configuration

​Training Configuration

​Dataset Preparation

​Model Loading

​LoRA Configuration

​Training Loop

​Inference

​LLM API Examples

​LLM Testing

DeepEval

Promptfoo

Ragas

UpTrain

​Distributed Training

​Popular LLM Models

Phi-3

Llama 3

Mistral 7B

Gemma 2

​Resources

Phi-3 Cookbook

PEFT Documentation

TRL Library

Distributed Training Guide

​Next Steps

Practice Exercises

Build docs developers (and LLMs) love

LLM Training: Phi-3 and GenAI

Overview

Why Phi-3?

Quick Start

Project Structure

Configuration

Training Configuration

Dataset Preparation

Model Loading

LoRA Configuration

Training Loop

Inference

LLM API Examples

LLM Testing

Distributed Training

Popular LLM Models

Resources

Next Steps