Skip to main content
This guide will help you quickly get started with Llama 2, from setup to running your first chat completion.
Llama 2 models range from 7B to 70B parameters and require sufficient GPU memory. Make sure you have the appropriate hardware for your chosen model size.

Prerequisites

Before you begin, ensure you have:
  • A conda environment with PyTorch and CUDA installed
  • wget and md5sum utilities (for model downloads)
  • An approved download request from Meta (see Step 1)

Quick Start Steps

1

Request Model Access

Visit the Meta website and register to download the Llama 2 models.Once approved, you’ll receive an email with a signed download URL. Keep this URL handy for Step 4.
Download links expire after 24 hours and have a download limit. If you encounter 403: Forbidden errors, request a new link.
2

Clone and Install

Clone the Llama 2 repository and install the package:
git clone https://github.com/facebookresearch/llama.git
cd llama
pip install -e .
This installs the following dependencies:
  • torch - PyTorch framework
  • fairscale - Model parallelism utilities
  • fire - CLI argument parsing
  • sentencepiece - Tokenization
3

Download Models

Run the download script to fetch model weights and tokenizer:
chmod +x download.sh
./download.sh
When prompted:
  1. Enter the URL from your email (manually copy, don’t use “Copy Link”)
  2. Select models to download: Choose from 7B, 13B, 70B, 7B-chat, 13B-chat, 70B-chat
    • Press Enter to download all models
    • Or specify models as comma-separated list: 7B,7B-chat
The script will download:
  • Model weights (consolidated.*.pth files)
  • Tokenizer (tokenizer.model)
  • Configuration (params.json)
  • License and usage policy files
4

Run Your First Chat Completion

Execute the chat completion example with the 7B chat model:
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 \
    --max_batch_size 6
  • --nproc_per_node: Number of GPUs (model parallel value)
    • 7B models: 1
    • 13B models: 2
    • 70B models: 8
  • --ckpt_dir: Path to your downloaded model directory
  • --tokenizer_path: Path to the tokenizer model
  • --max_seq_len: Maximum sequence length (models support up to 4096)
  • --max_batch_size: Batch size (adjust based on GPU memory)

Example Output

The chat completion example includes several pre-configured dialogs. Here’s what you can expect:
dialogs = [
    [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
    [
        {"role": "user", "content": "I am going to Paris, what should I see?"},
        {"role": "assistant", "content": "Paris has many attractions..."},
        {"role": "user", "content": "What is so great about #1?"}
    ],
    [
        {"role": "system", "content": "Always answer with Haiku"},
        {"role": "user", "content": "I am going to Paris, what should I see?"}
    ]
]
Each dialog demonstrates different capabilities:
  • Single-turn conversations
  • Multi-turn conversations with context
  • System prompts for behavior control

Running Text Completion

For pretrained (non-chat) models, use text completion instead:
torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 \
    --max_batch_size 4
Pretrained models are not fine-tuned for chat. Prompt them so the expected answer is the natural continuation of your prompt.

Text Completion Examples

The example includes various prompt types:
prompts = [
    "I believe the meaning of life is",
    "Simply put, the theory of relativity states that "
]

Customizing Generation

Both examples support tuning generation parameters:
ParameterDefaultDescription
temperature0.6Controls randomness (0.0 = deterministic, 1.0 = creative)
top_p0.9Nucleus sampling threshold for diversity
max_gen_lenModel defaultMaximum tokens to generate
Example with custom parameters:
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --temperature 0.8 \
    --top_p 0.95 \
    --max_seq_len 512 \
    --max_batch_size 6

Next Steps

Installation Guide

Detailed setup instructions, prerequisites, and troubleshooting

Llama Cookbook

Advanced examples with Hugging Face integration

Model Card

Model specifications and performance benchmarks

Responsible Use

Guidelines for safe and ethical AI deployment

Troubleshooting

Your download link has expired. Links are valid for 24 hours and limited downloads. Request a new URL from the Meta website.
Reduce max_seq_len and max_batch_size parameters. The cache is pre-allocated based on these values, so adjust according to your GPU memory:
# For limited GPU memory
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 256 \
    --max_batch_size 2
Ensure --nproc_per_node matches the model parallel (MP) requirements:
  • 7B models: --nproc_per_node 1
  • 13B models: --nproc_per_node 2
  • 70B models: --nproc_per_node 8

Build docs developers (and LLMs) love