Skip to main content

Overview

The sampling API provides flexible token selection strategies for text generation. Samplers can be chained together to create complex sampling pipelines.

Sampler Chain

A sampler chain applies multiple sampling strategies in sequence:
// Initialize sampler chain
LLAMA_API struct llama_sampler * llama_sampler_chain_init(
    struct llama_sampler_chain_params params
);

// Add sampler to chain (takes ownership)
LLAMA_API void llama_sampler_chain_add(
    struct llama_sampler * chain,
    struct llama_sampler * smpl
);

// Get sampler at index (-1 returns the chain itself)
LLAMA_API struct llama_sampler * llama_sampler_chain_get(
    struct llama_sampler * chain,
    int32_t i
);

// Get number of samplers
LLAMA_API int llama_sampler_chain_n(const struct llama_sampler * chain);

// Remove sampler (returns ownership to caller)
LLAMA_API struct llama_sampler * llama_sampler_chain_remove(
    struct llama_sampler * chain,
    int32_t i
);

Chain Parameters

typedef struct llama_sampler_chain_params {
    bool no_perf;  // Disable performance timing
} llama_sampler_chain_params;

LLAMA_API struct llama_sampler_chain_params llama_sampler_chain_default_params(void);

Basic Usage

// Create sampler chain
auto sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;

llama_sampler * smpl = llama_sampler_chain_init(sparams);

// Add greedy sampler (always picks highest probability token)
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());

// Use in generation loop
while (...) {
    llama_decode(ctx, batch);
    llama_token token = llama_sampler_sample(smpl, ctx, -1);
    // ...
}

llama_sampler_free(smpl);

Sampling Function

LLAMA_API llama_token llama_sampler_sample(
    struct llama_sampler * smpl,
    struct llama_context * ctx,
    int32_t idx
);
smpl
llama_sampler *
Sampler chain to use
ctx
llama_context *
Context containing logits from latest decode
idx
int32_t
Index of token to sample from (use -1 for last token, supports negative indexing)
return
llama_token
The sampled token ID
This function is shorthand for getting logits, applying the sampler chain, and accepting the selected token.

Available Samplers

Greedy Sampling

Always selects the token with highest probability:
LLAMA_API struct llama_sampler * llama_sampler_init_greedy(void);
Use for deterministic, focused output.

Distribution Sampling

Samples from the probability distribution:
LLAMA_API struct llama_sampler * llama_sampler_init_dist(uint32_t seed);
seed
uint32_t
Random seed (use LLAMA_DEFAULT_SEED for random seed)
Must be the last sampler in the chain (like greedy).

Top-K Sampling

Keeps only the top K most likely tokens:
LLAMA_API struct llama_sampler * llama_sampler_init_top_k(int32_t k);
k
int32_t
Number of top tokens to keep (<= 0 disables)
Reference: “The Curious Case of Neural Text Degeneration”
// Keep top 40 tokens
llama_sampler * top_k = llama_sampler_init_top_k(40);
llama_sampler_chain_add(chain, top_k);

Top-P (Nucleus) Sampling

Keeps tokens with cumulative probability >= p:
LLAMA_API struct llama_sampler * llama_sampler_init_top_p(
    float p,
    size_t min_keep
);
p
float
Cumulative probability threshold (0.0 to 1.0)
min_keep
size_t
Minimum number of tokens to keep
Reference: “The Curious Case of Neural Text Degeneration”
// Keep tokens until 90% cumulative probability
llama_sampler * top_p = llama_sampler_init_top_p(0.9, 1);
llama_sampler_chain_add(chain, top_p);

Min-P Sampling

Keeps tokens with probability >= p * max_probability:
LLAMA_API struct llama_sampler * llama_sampler_init_min_p(
    float p,
    size_t min_keep
);
p
float
Minimum probability threshold (relative to max)
min_keep
size_t
Minimum number of tokens to keep
Reference: https://github.com/ggml-org/llama.cpp/pull/3841

Temperature Sampling

Scales logits by temperature (higher = more random):
LLAMA_API struct llama_sampler * llama_sampler_init_temp(float t);
t
float
Temperature value. t <= 0.0 keeps only the maximum logit, rest set to -inf
Formula: logit' = logit / temperature
// Conservative sampling
llama_sampler_init_temp(0.2);  // Very focused

// Balanced sampling
llama_sampler_init_temp(0.8);  // Standard

// Creative sampling
llama_sampler_init_temp(1.2);  // More random

Dynamic Temperature

Adaptive temperature based on entropy:
LLAMA_API struct llama_sampler * llama_sampler_init_temp_ext(
    float t,
    float delta,
    float exponent
);
t
float
Base temperature
delta
float
Temperature adjustment range
exponent
float
Entropy scaling exponent
Reference: https://arxiv.org/abs/2309.02772

Typical Sampling

Samples locally typical tokens:
LLAMA_API struct llama_sampler * llama_sampler_init_typical(
    float p,
    size_t min_keep
);
Reference: https://arxiv.org/abs/2202.00666

Mirostat Sampling

Adaptive sampling that targets a specific perplexity:
LLAMA_API struct llama_sampler * llama_sampler_init_mirostat(
    int32_t n_vocab,
    uint32_t seed,
    float tau,      // Target cross-entropy
    float eta,      // Learning rate
    int32_t m       // Tokens for estimation
);

// Example
int32_t n_vocab = llama_vocab_n_tokens(vocab);
llama_sampler * miro = llama_sampler_init_mirostat(
    n_vocab,
    LLAMA_DEFAULT_SEED,
    5.0,   // tau
    0.1,   // eta
    100    // m
);
Reference: https://arxiv.org/abs/2007.14966
Mirostat samplers select the final token, so they should be last in the chain (like greedy or dist).

Penalty Samplers

Penalize repeated tokens:
LLAMA_API struct llama_sampler * llama_sampler_init_penalties(
    int32_t penalty_last_n,    // Last n tokens to consider (0 = disable, -1 = ctx size)
    float penalty_repeat,      // Repetition penalty (1.0 = disabled)
    float penalty_freq,        // Frequency penalty (0.0 = disabled)
    float penalty_present      // Presence penalty (0.0 = disabled)
);
penalty_last_n
int32_t
Number of recent tokens to penalize (0 = disabled, -1 = full context)
penalty_repeat
float
Repetition penalty multiplier (1.0 = no penalty, > 1.0 = penalize)
penalty_freq
float
Frequency penalty (0.0 = disabled)
penalty_present
float
Presence penalty (0.0 = disabled)
// Penalize repetition in last 64 tokens
llama_sampler * penalties = llama_sampler_init_penalties(
    64,    // last_n
    1.1,   // repeat penalty
    0.0,   // freq penalty
    0.0    // presence penalty
);
llama_sampler_chain_add(chain, penalties);
Avoid using penalties with full vocabulary as searching can be slow. Apply top-k/top-p first.

DRY Sampler

“Don’t Repeat Yourself” sampler:
LLAMA_API struct llama_sampler * llama_sampler_init_dry(
    const struct llama_vocab * vocab,
    int32_t n_ctx_train,
    float dry_multiplier,
    float dry_base,
    int32_t dry_allowed_length,
    int32_t dry_penalty_last_n,
    const char ** seq_breakers,
    size_t num_breakers
);
Reference: https://github.com/oobabooga/text-generation-webui/pull/5677

Adaptive-P Sampler

Maintains target probability over time:
LLAMA_API struct llama_sampler * llama_sampler_init_adaptive_p(
    float target,     // Target probability (0.0-1.0, negative = disabled)
    float decay,      // EMA decay (0.0-0.99)
    uint32_t seed     // Random seed
);
Adaptive-P selects the final token and should be last in the chain. Use mild truncation (e.g., min-p) before this sampler.
Reference: https://github.com/ggml-org/llama.cpp/pull/17927

XTC Sampler

LLAMA_API struct llama_sampler * llama_sampler_init_xtc(
    float p,
    float t,
    size_t min_keep,
    uint32_t seed
);
Reference: https://github.com/oobabooga/text-generation-webui/pull/6335

Top-n΃ Sampler

LLAMA_API struct llama_sampler * llama_sampler_init_top_n_sigma(float n);
Reference: https://arxiv.org/pdf/2411.07641

Grammar Sampler

Constrain output to match a GBNF grammar:
LLAMA_API struct llama_sampler * llama_sampler_init_grammar(
    const struct llama_vocab * vocab,
    const char * grammar_str,
    const char * grammar_root
);
vocab
const llama_vocab *
Vocabulary for tokenization
grammar_str
const char *
GBNF grammar production rules
grammar_root
const char *
Start symbol name
See grammars/README.md for grammar syntax.

Logit Bias

Manually bias specific tokens:
typedef struct llama_logit_bias {
    llama_token token;
    float bias;
} llama_logit_bias;

LLAMA_API struct llama_sampler * llama_sampler_init_logit_bias(
    int32_t n_vocab,
    int32_t n_logit_bias,
    const llama_logit_bias * logit_bias
);
// Increase probability of token 123, decrease 456
llama_logit_bias biases[] = {
    {.token = 123, .bias = 1.5},   // Boost
    {.token = 456, .bias = -1.5},  // Suppress
};

llama_sampler * bias = llama_sampler_init_logit_bias(
    llama_vocab_n_tokens(vocab),
    2,
    biases
);

Infill Sampler

For fill-in-the-middle tasks:
LLAMA_API struct llama_sampler * llama_sampler_init_infill(
    const struct llama_vocab * vocab
);
Use after top-k/top-p. Combines prefix probabilities and handles EOG tokens specially.

Sampler Management

Clone and Free

// Clone a sampler
LLAMA_API struct llama_sampler * llama_sampler_clone(
    const struct llama_sampler * smpl
);

// Free a sampler (don't free if added to chain)
LLAMA_API void llama_sampler_free(struct llama_sampler * smpl);
Do not manually free samplers that have been added to a chain. The chain takes ownership and will free them automatically.

Manual Application

// Apply sampler to token data array
LLAMA_API void llama_sampler_apply(
    struct llama_sampler * smpl,
    llama_token_data_array * cur_p
);

// Accept a selected token (update sampler state)
LLAMA_API void llama_sampler_accept(
    struct llama_sampler * smpl,
    llama_token token
);

// Reset sampler state
LLAMA_API void llama_sampler_reset(struct llama_sampler * smpl);

// Get sampler name
LLAMA_API const char * llama_sampler_name(const struct llama_sampler * smpl);

Get Seed

LLAMA_API uint32_t llama_sampler_get_seed(const struct llama_sampler * smpl);
Returns the seed used by the sampler, or LLAMA_DEFAULT_SEED if not applicable.

Performance Monitoring

struct llama_perf_sampler_data {
    double t_sample_ms;  // Time spent sampling (milliseconds)
    int32_t n_sample;    // Number of tokens sampled
};

LLAMA_API struct llama_perf_sampler_data llama_perf_sampler(
    const struct llama_sampler * chain
);

LLAMA_API void llama_perf_sampler_print(const struct llama_sampler * chain);
LLAMA_API void llama_perf_sampler_reset(struct llama_sampler * chain);
Performance functions only work with sampler chains created via llama_sampler_chain_init.

Common Sampling Configurations

llama_sampler * smpl = llama_sampler_chain_init(
    llama_sampler_chain_default_params()
);
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());

Complete Sampling Example

#include "llama.h"
#include <stdio.h>

int main() {
    // ... (model and context setup) ...
    
    // Create comprehensive sampling chain
    llama_sampler * smpl = llama_sampler_chain_init(
        llama_sampler_chain_default_params()
    );
    
    // 1. Remove unlikely tokens (top-k)
    llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
    
    // 2. Penalize repetition
    llama_sampler_chain_add(smpl, 
        llama_sampler_init_penalties(
            64,    // last 64 tokens
            1.1,   // repeat penalty
            0.0,   // freq penalty
            0.0    // presence penalty
        )
    );
    
    // 3. Apply nucleus sampling
    llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.9, 1));
    
    // 4. Scale by temperature
    llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
    
    // 5. Sample from distribution
    llama_sampler_chain_add(smpl, 
        llama_sampler_init_dist(LLAMA_DEFAULT_SEED)
    );
    
    // Generation loop
    for (int i = 0; i < n_predict; i++) {
        // Decode current batch
        if (llama_decode(ctx, batch) != 0) {
            fprintf(stderr, "Decode failed\n");
            break;
        }
        
        // Sample next token
        llama_token token = llama_sampler_sample(smpl, ctx, -1);
        
        if (llama_vocab_is_eog(vocab, token)) {
            break;
        }
        
        // Print and continue
        char buf[128];
        int n = llama_token_to_piece(vocab, token, buf, sizeof(buf), 0, true);
        printf("%.*s", n, buf);
        fflush(stdout);
        
        batch = llama_batch_get_one(&token, 1);
    }
    
    // Print performance stats
    llama_perf_sampler_print(smpl);
    
    // Cleanup
    llama_sampler_free(smpl);
    // ... (free context and model) ...
    
    return 0;
}

Token Data Array (Advanced)

For manual sampling without llama_sampler_sample:
typedef struct llama_token_data {
    llama_token id;   // Token ID
    float logit;      // Log-odds
    float p;          // Probability
} llama_token_data;

typedef struct llama_token_data_array {
    llama_token_data * data;
    size_t size;
    int64_t selected;  // Index of selected token (not token ID)
    bool sorted;       // Don't assume sorted, always check this flag
} llama_token_data_array;

Next Steps

Inference

Learn about batching and decoding

libllama Overview

Return to API overview