Quickstart

This guide will help you quickly get started with Llama 2, from setup to running your first chat completion.

Llama 2 models range from 7B to 70B parameters and require sufficient GPU memory. Make sure you have the appropriate hardware for your chosen model size.

Prerequisites

Before you begin, ensure you have:

A conda environment with PyTorch and CUDA installed
wget and md5sum utilities (for model downloads)
An approved download request from Meta (see Step 1)

Quick Start Steps

Request Model Access

Visit the Meta website and register to download the Llama 2 models.Once approved, you’ll receive an email with a signed download URL. Keep this URL handy for Step 4.

Download links expire after 24 hours and have a download limit. If you encounter 403: Forbidden errors, request a new link.

Clone and Install

Clone the Llama 2 repository and install the package:

git clone https://github.com/facebookresearch/llama.git
cd llama
pip install -e .

This installs the following dependencies:

torch - PyTorch framework
fairscale - Model parallelism utilities
fire - CLI argument parsing
sentencepiece - Tokenization

Download Models

Run the download script to fetch model weights and tokenizer:

chmod +x download.sh
./download.sh

When prompted:

Enter the URL from your email (manually copy, don’t use “Copy Link”)
Select models to download: Choose from 7B, 13B, 70B, 7B-chat, 13B-chat, 70B-chat
- Press Enter to download all models
- Or specify models as comma-separated list: 7B,7B-chat

The script will download:

Model weights (consolidated.*.pth files)
Tokenizer (tokenizer.model)
Configuration (params.json)
License and usage policy files

Run Your First Chat Completion

Execute the chat completion example with the 7B chat model:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 \
    --max_batch_size 6

Understanding the Parameters

--nproc_per_node: Number of GPUs (model parallel value)
- 7B models: 1
- 13B models: 2
- 70B models: 8
--ckpt_dir: Path to your downloaded model directory
--tokenizer_path: Path to the tokenizer model
--max_seq_len: Maximum sequence length (models support up to 4096)
--max_batch_size: Batch size (adjust based on GPU memory)

Example Output

The chat completion example includes several pre-configured dialogs. Here’s what you can expect:

dialogs = [
    [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
    [
        {"role": "user", "content": "I am going to Paris, what should I see?"},
        {"role": "assistant", "content": "Paris has many attractions..."},
        {"role": "user", "content": "What is so great about #1?"}
    ],
    [
        {"role": "system", "content": "Always answer with Haiku"},
        {"role": "user", "content": "I am going to Paris, what should I see?"}
    ]
]

Each dialog demonstrates different capabilities:

Single-turn conversations
Multi-turn conversations with context
System prompts for behavior control

Running Text Completion

For pretrained (non-chat) models, use text completion instead:

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 \
    --max_batch_size 4

Pretrained models are not fine-tuned for chat. Prompt them so the expected answer is the natural continuation of your prompt.

Text Completion Examples

The example includes various prompt types:

prompts = [
    "I believe the meaning of life is",
    "Simply put, the theory of relativity states that "
]

Customizing Generation

Both examples support tuning generation parameters:

Parameter	Default	Description
`temperature`	0.6	Controls randomness (0.0 = deterministic, 1.0 = creative)
`top_p`	0.9	Nucleus sampling threshold for diversity
`max_gen_len`	Model default	Maximum tokens to generate

Example with custom parameters:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --temperature 0.8 \
    --top_p 0.95 \
    --max_seq_len 512 \
    --max_batch_size 6

Next Steps

Installation Guide

Detailed setup instructions, prerequisites, and troubleshooting

Llama Cookbook

Advanced examples with Hugging Face integration

Model Card

Model specifications and performance benchmarks

Responsible Use

Guidelines for safe and ethical AI deployment

Troubleshooting

403 Forbidden error during download

Your download link has expired. Links are valid for 24 hours and limited downloads. Request a new URL from the Meta website.

Out of memory error

Reduce max_seq_len and max_batch_size parameters. The cache is pre-allocated based on these values, so adjust according to your GPU memory:

# For limited GPU memory
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 256 \
    --max_batch_size 2

Wrong nproc_per_node value

Ensure --nproc_per_node matches the model parallel (MP) requirements:

7B models: --nproc_per_node 1
13B models: --nproc_per_node 2
70B models: --nproc_per_node 8

Get Started

Model Usage

Core Concepts

Model Variants

Prerequisites

Quick Start Steps

Example Output

Running Text Completion

Text Completion Examples

Customizing Generation

Next Steps

Installation Guide

Llama Cookbook

Model Card

Responsible Use

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Model Usage

Core Concepts

Model Variants

​Prerequisites

​Quick Start Steps

​Example Output

​Running Text Completion

​Text Completion Examples

​Customizing Generation

​Next Steps

Installation Guide

Llama Cookbook

Model Card

Responsible Use

​Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Quick Start Steps

Example Output

Running Text Completion

Text Completion Examples

Customizing Generation

Next Steps

Troubleshooting