Skip to main content

Overview

Text completion with Llama 2 uses pretrained models to generate natural continuations of prompts. These models are not fine-tuned for chat or Q&A - they should be prompted so that the expected answer is the natural continuation of the prompt.

Basic Usage

First, build a Llama instance and then call text_completion() with your prompts:
from llama import Llama
from typing import List

generator = Llama.build(
    ckpt_dir="llama-2-7b/",
    tokenizer_path="tokenizer.model",
    max_seq_len=128,
    max_batch_size=4,
)

prompts: List[str] = [
    "I believe the meaning of life is",
    "Simply put, the theory of relativity states that ",
]

results = generator.text_completion(
    prompts,
    max_gen_len=64,
    temperature=0.6,
    top_p=0.9,
)

for prompt, result in zip(prompts, results):
    print(prompt)
    print(f"> {result['generation']}")

Parameters

Build Parameters

ckpt_dir
str
required
Path to the directory containing checkpoint files for the pretrained model.
tokenizer_path
str
required
Path to the tokenizer model used for text encoding/decoding.
max_seq_len
int
required
Maximum sequence length for input prompts. Defaults to 128 for text completion. All models support up to 4096 tokens, but cache is pre-allocated based on this value.
max_batch_size
int
required
Maximum batch size for generating sequences. Defaults to 4.

Generation Parameters

prompts
List[str]
required
List of text prompts for completion.
temperature
float
default:"0.6"
Temperature value for controlling randomness in generation. Higher values (e.g., 1.0) make output more random, lower values (e.g., 0.1) make it more deterministic.
top_p
float
default:"0.9"
Top-p probability threshold for nucleus sampling. Controls diversity by sampling from the smallest set of tokens whose cumulative probability exceeds this threshold.
max_gen_len
int
default:"64"
Maximum length of generated sequences. If not provided, it’s set to the model’s maximum sequence length minus 1.
logprobs
bool
default:"false"
Whether to compute and return token log probabilities.
echo
bool
default:"false"
Whether to include prompt tokens in the generated output.

Example Prompts

Natural Continuation

prompts = [
    "I believe the meaning of life is",
    "Simply put, the theory of relativity states that ",
]

Few-Shot Translation

prompts = [
    """Translate English to French:
    
    sea otter => loutre de mer
    peppermint => menthe poivrée
    plush girafe => girafe peluche
    cheese =>""",
]

Message Completion

prompts = [
    """A brief message congratulating the team on the launch:

    Hi everyone,
    
    I just """,
]

Response Format

The text_completion() method returns a list of CompletionPrediction dictionaries:
[
    {
        "generation": str,  # The generated text
        "tokens": List[str],  # Optional: decoded tokens (if logprobs=True)
        "logprobs": List[float]  # Optional: log probabilities (if logprobs=True)
    }
]

Running from Command Line

Run the example script with the appropriate model parallel value:
torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4
See Model Parallel Configuration for the correct nproc_per_node value for your model size.

Build docs developers (and LLMs) love