Text Generation
Llama 2 uses autoregressive generation with nucleus (top-p) sampling to produce high-quality text. The generation process supports various parameters for controlling randomness, output format, and token probability computation.Core Generation Method
Thegenerate method is the foundation for all text generation:
Parameters
- prompt_tokens: List of tokenized prompts (batch of token sequences)
- max_gen_len: Maximum number of tokens to generate
- temperature: Controls randomness (higher = more random, lower = more deterministic)
- top_p: Nucleus sampling threshold (0.0-1.0)
- logprobs: Whether to return log probabilities for each token
- echo: Whether to include prompt tokens in the output
Temperature Sampling
Temperature controls the randomness of predictions by scaling the logits before applying softmax:- temperature = 0: Greedy decoding (always pick the highest probability token)
- temperature < 1 (e.g., 0.6): Sharper distribution, more focused and deterministic
- temperature = 1: Use raw probabilities
- temperature > 1: Flatter distribution, more random and creative
Top-p (Nucleus) Sampling
Nucleus sampling selects from the smallest set of tokens whose cumulative probability exceedsp:
How Nucleus Sampling Works
- Sort tokens by probability in descending order
- Compute cumulative sum of probabilities
- Create mask for tokens where cumulative probability exceeds
p - Zero out probabilities of tokens outside the nucleus
- Renormalize the remaining probabilities
- Sample from the renormalized distribution
[0.4, 0.3, 0.2, 0.05, 0.03, 0.02] with top_p = 0.9:
- Probabilities are already sorted
- Cumulative:
[0.4, 0.7, 0.9, 0.95, 0.98, 1.0] - Nucleus includes first 3 tokens (0.4 + 0.3 + 0.2 = 0.9)
- Renormalize:
[0.44, 0.33, 0.22, 0, 0, 0] - Sample from these 3 tokens only
Top-p vs Top-k
- Top-p (nucleus): Dynamic vocabulary size based on probability distribution
- Top-k: Fixed vocabulary size (always use top k tokens)
Log Probabilities
Whenlogprobs=True, the model returns the negative log probability for each generated token:
- Model confidence assessment: Lower negative log prob = higher confidence
- Beam search: Ranking multiple generation paths
- Fine-tuning: Computing training objectives
- Debugging: Understanding model behavior
Echo Parameter
Theecho parameter controls whether prompt tokens are included in the output:
- echo=True: Output includes both prompt and generated tokens
- echo=False: Output includes only newly generated tokens
Generation Loop
The autoregressive generation loop processes one token at a time:Key steps:
- Forward pass: Compute logits for the current position
- Sampling: Apply temperature and top-p to select next token
- Masking: Preserve original prompt tokens using
input_text_mask - EOS detection: Stop generation when all sequences reach EOS
- KV cache update:
prev_postracks cache position for efficiency
Text Completion Interface
Thetext_completion method provides a high-level interface:
Example Usage
Basic generation
Deterministic generation (greedy)
Creative generation
With log probabilities
Echo prompt with generation
Best Practices
For factual/precise tasks:
- Use low temperature (0.0 - 0.3)
- Use lower top_p (0.7 - 0.85)
For creative/diverse tasks:
- Use higher temperature (0.7 - 1.0)
- Use higher top_p (0.9 - 0.95)
For production systems:
- Enable logprobs to monitor model confidence
- Set max_gen_len to prevent excessive generation
- Use temperature=0 for deterministic outputs when needed