Skip to main content
Deprecation Notice: This repository contains the original Llama 2 implementation. For newer models and the Llama Stack, please use the following repositories:

Llama 2 Documentation

Run inference with Meta’s powerful open-source large language models. From 7B to 70B parameters, with optimized transformer architecture and multi-GPU support.

Quick Start

Get up and running with Llama 2 in minutes

1

Install dependencies

Install the package with pip in a conda environment with PyTorch and CUDA available.
pip install -e .
2

Download model weights

Visit the Meta website to register and download model weights. You’ll receive a signed URL via email.
./download.sh
When prompted, paste the URL from your email. Links expire after 24 hours.
3

Run your first inference

Use torchrun to execute chat completion with the downloaded model.
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6
User: what is the recipe of mayonnaise?

> Assistant: Mayonnaise is a thick, creamy condiment made from egg yolks, oil, and acid (such as vinegar or lemon juice)...

Explore by Topic

Everything you need to work with Llama 2 models

Text Completion

Generate natural continuations of text prompts with pre-trained models

Chat Completion

Build conversational AI with fine-tuned Llama-2-Chat models

Model Parallelism

Scale to larger models with multi-GPU inference using FairScale

Architecture

Understand the optimized transformer architecture and GQA

Tokenization

Learn how SentencePiece encoding and special tokens work

Generation

Control randomness with temperature, top-p sampling, and more

Model Variants

Choose the right model for your use case

Pre-trained Models

7B, 13B, and 70B parameter models for text completion tasks

Chat Models

Fine-tuned models optimized for dialogue and assistant use cases

API Reference

Complete documentation of classes, methods, and types

Llama Class

Build, generate, and complete text with the main inference class

Tokenizer

Encode and decode text using SentencePiece tokenization

Transformer

Core transformer model with attention and feedforward layers

Types

TypedDict definitions for messages, predictions, and dialogs

Ready to get started?

Follow our quickstart guide to download model weights and run your first inference in minutes.

Get Started