Llama 2 Documentation
Run inference with Meta’s powerful open-source large language models. From 7B to 70B parameters, with optimized transformer architecture and multi-GPU support.
Quick Start
Get up and running with Llama 2 in minutes
Install dependencies
Install the package with pip in a conda environment with PyTorch and CUDA available.
Download model weights
Visit the Meta website to register and download model weights. You’ll receive a signed URL via email.
When prompted, paste the URL from your email. Links expire after 24 hours.
Explore by Topic
Everything you need to work with Llama 2 models
Text Completion
Generate natural continuations of text prompts with pre-trained models
Chat Completion
Build conversational AI with fine-tuned Llama-2-Chat models
Model Parallelism
Scale to larger models with multi-GPU inference using FairScale
Architecture
Understand the optimized transformer architecture and GQA
Tokenization
Learn how SentencePiece encoding and special tokens work
Generation
Control randomness with temperature, top-p sampling, and more
Model Variants
Choose the right model for your use case
Pre-trained Models
7B, 13B, and 70B parameter models for text completion tasks
Chat Models
Fine-tuned models optimized for dialogue and assistant use cases
API Reference
Complete documentation of classes, methods, and types
Llama Class
Build, generate, and complete text with the main inference class
Tokenizer
Encode and decode text using SentencePiece tokenization
Transformer
Core transformer model with attention and feedforward layers
Types
TypedDict definitions for messages, predictions, and dialogs
Ready to get started?
Follow our quickstart guide to download model weights and run your first inference in minutes.
Get Started