Chat Completion

Overview

Chat completion uses fine-tuned Llama 2 models trained for dialogue applications. These models follow a specific formatting with [INST] and <<SYS>> tags to handle multi-turn conversations with system prompts, user messages, and assistant responses.

Basic Usage

Build a Llama instance and call chat_completion() with your dialogs:

from llama import Llama, Dialog
from typing import List

generator = Llama.build(
    ckpt_dir="llama-2-7b-chat/",
    tokenizer_path="tokenizer.model",
    max_seq_len=512,
    max_batch_size=8,
)

dialogs: List[Dialog] = [
    [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
]

results = generator.chat_completion(
    dialogs,
    max_gen_len=None,
    temperature=0.6,
    top_p=0.9,
)

for dialog, result in zip(dialogs, results):
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}")
    print(f"> {result['generation']['role'].capitalize()}: {result['generation']['content']}")

Parameters

Build Parameters

ckpt_dir

str

required

Path to the directory containing checkpoint files for the chat model. Use a -chat model variant (e.g., llama-2-7b-chat/).

tokenizer_path

str

required

Path to the tokenizer model used for text encoding/decoding.

max_seq_len

int

required

Maximum sequence length for input prompts. Defaults to 512 for chat completion. All models support up to 4096 tokens.

max_batch_size

int

required

Maximum batch size for generating sequences. Defaults to 8.

Generation Parameters

dialogs

List[Dialog]

required

List of conversational dialogs, where each dialog is a list of messages. Each message is a dictionary with role and content keys.

temperature

float

default:"0.6"

Temperature value for controlling randomness in generation. Higher values increase creativity, lower values make responses more focused.

top_p

float

default:"0.9"

Top-p probability threshold for nucleus sampling. Controls response diversity.

max_gen_len

int

default:"None"

Maximum length of generated response. If None, it will be set to the model’s max sequence length.

logprobs

bool

default:"false"

Whether to compute and return token log probabilities.

Dialog Format

Each dialog is a list of messages with roles:

system: Optional system prompt that guides the assistant’s behavior
user: User messages
assistant: Assistant responses (for multi-turn conversations)

The last message in a dialog must always be from the user role.

Simple Query

dialogs = [
    [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
]

Multi-Turn Conversation

dialogs = [
    [
        {"role": "user", "content": "I am going to Paris, what should I see?"},
        {
            "role": "assistant",
            "content": """Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.""",
        },
        {"role": "user", "content": "What is so great about #1?"},
    ],
]

System Prompt

dialogs = [
    [
        {"role": "system", "content": "Always answer with Haiku"},
        {"role": "user", "content": "I am going to Paris, what should I see?"},
    ],
]

Custom System Instructions

dialogs = [
    [
        {
            "role": "system",
            "content": """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
        },
        {"role": "user", "content": "Write a brief birthday message to John"},
    ],
]

Creative System Prompts

dialogs = [
    [
        {"role": "system", "content": "Always answer with emojis"},
        {"role": "user", "content": "How to go from Beijing to NY?"},
    ],
]

Response Format

The chat_completion() method returns a list of ChatPrediction dictionaries:

[
    {
        "generation": {
            "role": "assistant",
            "content": str  # The generated response
        },
        "tokens": List[str],  # Optional: decoded tokens (if logprobs=True)
        "logprobs": List[float]  # Optional: log probabilities (if logprobs=True)
    }
]

Safety

The chat completion method includes safety checks:

Special tags ([INST], [/INST], <<SYS>>, <</SYS>>) are not allowed in user content
Unsafe requests return: "Error: special tags are not allowed as part of the prompt."

For production use, consider deploying additional classifiers for filtering unsafe inputs and outputs.

Running from Command Line

Run the example script with a chat model:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

See Model Parallel Configuration for the correct nproc_per_node value for your model size.

Pretrained vs Chat Models

Pretrained models: Use text_completion() for natural text continuation
Chat models (e.g., llama-2-7b-chat): Use chat_completion() for dialogue applications with proper formatting

Chat models require specific formatting with [INST] tags and follow a strict role alternation pattern (user/assistant/user/…). The chat_completion() method handles this formatting automatically.

Get Started

Model Usage

Core Concepts

Model Variants

Chat Completion

Overview

Basic Usage

Parameters

Build Parameters

Generation Parameters

Dialog Format

Simple Query

Multi-Turn Conversation

System Prompt

Custom System Instructions

Creative System Prompts

Response Format

Safety

Running from Command Line

Pretrained vs Chat Models

Build docs developers (and LLMs) love

Get Started

Model Usage

Core Concepts

Model Variants

​Overview

​Basic Usage

​Parameters

​Build Parameters

​Generation Parameters

​Dialog Format

​Simple Query

​Multi-Turn Conversation

​System Prompt

​Custom System Instructions

​Creative System Prompts

​Response Format

​Safety

​Running from Command Line

​Pretrained vs Chat Models

Build docs developers (and LLMs) love

Overview

Basic Usage

Parameters

Build Parameters

Generation Parameters

Dialog Format

Simple Query

Multi-Turn Conversation

System Prompt

Custom System Instructions

Creative System Prompts

Response Format

Safety

Running from Command Line

Pretrained vs Chat Models