Skip to main content
This guide provides comprehensive instructions for installing Llama 2 and all its dependencies.

System Requirements

Hardware Requirements

Llama 2 models have varying hardware requirements based on size:
Model SizeModel Parallel (MP)Minimum GPU MemoryRecommended GPUs
7B116 GB1x A100 or V100
13B232 GB2x A100 or V100
70B8128 GB8x A100 or V100
All models support sequence lengths up to 4096 tokens, but memory is pre-allocated based on max_seq_len and max_batch_size parameters.

Software Requirements

  • Operating System: Linux (Ubuntu 18.04+, CentOS 7+) or macOS
  • Python: 3.8 or higher
  • CUDA: 11.0 or higher (for GPU acceleration)
  • conda: Anaconda or Miniconda
  • Utilities: wget, md5sum (or md5 on macOS)

Installation Steps

1

Set Up Conda Environment

Create a new conda environment with Python 3.8+:
conda create -n llama python=3.8
conda activate llama
This isolates Llama 2 dependencies from other projects.
2

Install PyTorch with CUDA

Install PyTorch with CUDA support for GPU acceleration:
conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia
CPU-only installation is not recommended for production use. Inference will be significantly slower.
Verify PyTorch installation:
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA Available: {torch.cuda.is_available()}')"
Expected output:
PyTorch: 2.0.0
CUDA Available: True
3

Clone Llama Repository

Clone the official Llama 2 repository:
git clone https://github.com/facebookresearch/llama.git
cd llama
4

Install Llama Package

Install the Llama package in editable mode:
pip install -e .
This installs the following dependencies from requirements.txt:
PackagePurpose
torchPyTorch deep learning framework
fairscaleModel parallelism and memory optimization
fireCommand-line interface generation
sentencepieceTokenization library
The -e flag installs in editable mode, allowing you to modify the source code.
5

Verify Installation

Verify the installation by importing the Llama module:
python -c "from llama import Llama; print('Llama package installed successfully')"
If successful, you should see:
Llama package installed successfully

Model Download Process

Request Access

1

Register for Access

Visit the Meta Llama Downloads page and complete the registration form.You’ll need to provide:
  • Name and email
  • Organization (optional)
  • Country
  • Intended use case
2

Accept License

Review and accept the Llama 2 Community License Agreement.
Ensure you understand the license terms, including acceptable use policies and restrictions.
3

Receive Download URL

After approval (typically within hours), you’ll receive an email with a unique, signed download URL.
  • URLs expire after 24 hours
  • URLs have download limits
  • You can request new URLs if needed

Download Models

The download.sh script automates model and tokenizer downloads:
chmod +x download.sh
./download.sh

Script Workflow

1

Enter Download URL

When prompted, paste the URL from your email:
Enter the URL from email: https://download.llamameta.net/*?Policy=...
Manually copy-paste the URL. Do not use browser “Copy Link” functionality.
2

Select Models

Choose which models to download:
Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all:
Options:
  • Pretrained models: 7B, 13B, 70B
  • Chat models: 7B-chat, 13B-chat, 70B-chat
  • Download all: Press Enter without input
  • Download specific: 7B,7B-chat (no spaces)
Pretrained Models (7B, 13B, 70B)
  • Base models trained on text completion
  • Use for tasks where the answer is a natural continuation
  • Example: "The theory of relativity states that" → model completes
Chat Models (7B-chat, 13B-chat, 70B-chat)
  • Fine-tuned for dialogue and instruction-following
  • Require specific formatting with INST and <<SYS>> tags
  • Better for conversational AI and Q&A
3

Download and Verify

The script will:
  1. Download LICENSE and usage policy
  2. Download tokenizer and verify checksums
  3. For each model:
    • Download model shards (consolidated.*.pth)
    • Download configuration (params.json)
    • Verify file integrity with checksums
Downloading LICENSE and Acceptable Usage Policy
Downloading tokenizer
Downloading llama-2-7b-chat
Checking checksums
Model downloads are large:
  • 7B: ~13 GB
  • 13B: ~26 GB
  • 70B: ~138 GB
Ensure you have sufficient disk space.

Understanding Downloaded Files

After downloading, your directory structure will look like:
llama/
├── llama-2-7b-chat/
│   ├── consolidated.00.pth      # Model weights
│   ├── params.json               # Model configuration
│   └── checklist.chk             # Checksum file
├── llama-2-7b/
│   ├── consolidated.00.pth
│   ├── params.json
│   └── checklist.chk
├── tokenizer.model               # Shared tokenizer
├── tokenizer_checklist.chk       # Tokenizer checksum
├── LICENSE                       # License agreement
└── USE_POLICY.md                # Acceptable use policy
Larger models are split into multiple shard files:
ModelShardsFiles
7B1consolidated.00.pth
13B2consolidated.00.pth, consolidated.01.pth
70B8consolidated.00.pth through consolidated.07.pth
Sharding enables distribution across multiple GPUs for model parallelism.

Alternative: Hugging Face Downloads

You can also access Llama 2 models through Hugging Face:
1

Request Hugging Face Access

Visit a Llama 2 model repository on Hugging Face (e.g., meta-llama/Llama-2-7b-chat-hf).Acknowledge the license and fill out the access form.
2

Install Hugging Face Libraries

pip install transformers accelerate
3

Download and Use Models

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)
The official Llama repository provides more control and lower-level access, while Hugging Face offers easier integration with the transformers ecosystem.

Verification and Testing

After installation, verify everything works:

Quick Verification

# Test with chat model
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 \
    --max_batch_size 6
If successful, you’ll see chat completions for pre-configured dialogs.

Test Text Completion

# Test with pretrained model
torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 \
    --max_batch_size 4

Troubleshooting

Ensure you’re in the correct conda environment and installed the package:
conda activate llama
pip install -e .
Reduce memory allocation parameters:
torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 256 \
    --max_batch_size 2
Or use a smaller model (7B instead of 13B/70B).
Your download URL has expired. Request a new URL from the Meta website.
The downloaded files may be corrupted. Delete the affected model directory and re-run the download script.
rm -rf llama-2-7b-chat/
./download.sh
Some systems require manual fairscale installation:
pip install fairscale --no-build-isolation
Or install from source:
git clone https://github.com/facebookresearch/fairscale.git
cd fairscale
pip install .

Next Steps

Quickstart Guide

Run your first inference in minutes

Llama Cookbook

Advanced examples and integrations

Model Card

Detailed model specifications

FAQ

Frequently asked questions

Additional Resources

Build docs developers (and LLMs) love