Llama 2 models range from 7B to 70B parameters and require sufficient GPU memory. Make sure you have the appropriate hardware for your chosen model size.
Prerequisites
Before you begin, ensure you have:- A conda environment with PyTorch and CUDA installed
wgetandmd5sumutilities (for model downloads)- An approved download request from Meta (see Step 1)
Quick Start Steps
Request Model Access
Visit the Meta website and register to download the Llama 2 models.Once approved, you’ll receive an email with a signed download URL. Keep this URL handy for Step 4.
Clone and Install
Clone the Llama 2 repository and install the package:This installs the following dependencies:
torch- PyTorch frameworkfairscale- Model parallelism utilitiesfire- CLI argument parsingsentencepiece- Tokenization
Download Models
Run the download script to fetch model weights and tokenizer:When prompted:
- Enter the URL from your email (manually copy, don’t use “Copy Link”)
- Select models to download: Choose from
7B,13B,70B,7B-chat,13B-chat,70B-chat- Press Enter to download all models
- Or specify models as comma-separated list:
7B,7B-chat
- Model weights (
consolidated.*.pthfiles) - Tokenizer (
tokenizer.model) - Configuration (
params.json) - License and usage policy files
Run Your First Chat Completion
Execute the chat completion example with the 7B chat model:
Understanding the Parameters
Understanding the Parameters
--nproc_per_node: Number of GPUs (model parallel value)- 7B models: 1
- 13B models: 2
- 70B models: 8
--ckpt_dir: Path to your downloaded model directory--tokenizer_path: Path to the tokenizer model--max_seq_len: Maximum sequence length (models support up to 4096)--max_batch_size: Batch size (adjust based on GPU memory)
Example Output
The chat completion example includes several pre-configured dialogs. Here’s what you can expect:- Single-turn conversations
- Multi-turn conversations with context
- System prompts for behavior control
Running Text Completion
For pretrained (non-chat) models, use text completion instead:Pretrained models are not fine-tuned for chat. Prompt them so the expected answer is the natural continuation of your prompt.
Text Completion Examples
The example includes various prompt types:Customizing Generation
Both examples support tuning generation parameters:| Parameter | Default | Description |
|---|---|---|
temperature | 0.6 | Controls randomness (0.0 = deterministic, 1.0 = creative) |
top_p | 0.9 | Nucleus sampling threshold for diversity |
max_gen_len | Model default | Maximum tokens to generate |
Next Steps
Installation Guide
Detailed setup instructions, prerequisites, and troubleshooting
Llama Cookbook
Advanced examples with Hugging Face integration
Model Card
Model specifications and performance benchmarks
Responsible Use
Guidelines for safe and ethical AI deployment
Troubleshooting
403 Forbidden error during download
403 Forbidden error during download
Your download link has expired. Links are valid for 24 hours and limited downloads. Request a new URL from the Meta website.
Out of memory error
Out of memory error
Reduce
max_seq_len and max_batch_size parameters. The cache is pre-allocated based on these values, so adjust according to your GPU memory:Wrong nproc_per_node value
Wrong nproc_per_node value
Ensure
--nproc_per_node matches the model parallel (MP) requirements:- 7B models:
--nproc_per_node 1 - 13B models:
--nproc_per_node 2 - 70B models:
--nproc_per_node 8