Skip to main content
Llama 2 pretrained models are base models that generate text by naturally continuing the input prompt. These models are trained on 2 trillion tokens but are not fine-tuned for chat or question-answering.

Use Cases

Pretrained models excel at:
  • Text completion: Continuing a sentence or paragraph naturally
  • Few-shot learning: Following patterns from examples in the prompt
  • Creative writing: Generating stories, articles, or content
  • Code generation: Completing code snippets
  • Translation: When provided with example patterns
  • Custom fine-tuning: Serving as a base for domain-specific fine-tuning

Natural Continuation Prompts

The key to using pretrained models is crafting prompts where the expected answer is the natural continuation of the prompt.

Basic Examples

# Philosophical completion
"I believe the meaning of life is"

# Scientific explanation
"Simply put, the theory of relativity states that "

# Professional writing
"""A brief message congratulating the team on the launch:

Hi everyone,

I just """

Few-Shot Learning

Provide examples in the prompt to guide the model:
"""Translate English to French:

sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>"""

Running Text Completion

Command Line

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4
  • Set --nproc_per_node to the Model Parallel (MP) value for your model size (7B=1, 13B=2, 70B=8)
  • Adjust max_seq_len and max_batch_size based on available GPU memory

Python Code

from llama import Llama
from typing import List

# Initialize the model
generator = Llama.build(
    ckpt_dir="llama-2-7b/",
    tokenizer_path="tokenizer.model",
    max_seq_len=128,
    max_batch_size=4,
)

# Define prompts
prompts: List[str] = [
    "I believe the meaning of life is",
    "Simply put, the theory of relativity states that ",
]

# Generate completions
results = generator.text_completion(
    prompts,
    max_gen_len=64,
    temperature=0.6,
    top_p=0.9,
)

# Print results
for prompt, result in zip(prompts, results):
    print(prompt)
    print(f"> {result['generation']}")
    print("\n" + "="*34 + "\n")

Parameters

Generation Parameters

ParameterDefaultDescription
temperature0.6Controls randomness (0.0 = deterministic, 1.0 = very random)
top_p0.9Nucleus sampling threshold for diversity
max_gen_len64Maximum number of tokens to generate
max_seq_len128Maximum sequence length for input prompts
max_batch_size4Maximum number of prompts to process simultaneously

Model Loading Parameters

ParameterDescription
ckpt_dirDirectory containing checkpoint files (e.g., llama-2-7b/)
tokenizer_pathPath to tokenizer model file
max_seq_lenMust be ≤ 4096 tokens
max_batch_sizeAdjust based on GPU memory

Performance Benchmarks

Llama 2 pretrained models performance:
Benchmark Category7B13B70B
Code (HumanEval, MBPP)16.824.537.5
Commonsense Reasoning63.966.971.9
World Knowledge48.955.463.6
Reading Comprehension61.365.869.4
Math (GSM8K, MATH)14.628.735.2
MMLU45.354.868.9

Best Practices

  • Write prompts that naturally lead to the desired completion
  • Use few-shot examples to establish patterns
  • Keep context within the 4096 token limit
  • Call strip() on inputs to avoid double-spaces
  • Set max_seq_len to the minimum needed for your use case
  • Reduce max_batch_size if encountering out-of-memory errors
  • Use appropriate model size: 7B for most tasks, 70B for maximum quality
  • Lower temperature (e.g., 0.2) for more focused outputs
  • 7B model: Single GPU (model parallel = 1)
  • 13B model: 2 GPUs (model parallel = 2)
  • 70B model: 8 GPUs (model parallel = 8)
  • Memory is pre-allocated based on max_seq_len and max_batch_size

Next Steps

Chat Models

Learn about fine-tuned chat models for dialogue applications

Model Overview

Compare all model variants and sizes

Build docs developers (and LLMs) love