Sampling

Overview

The sampling API provides flexible token selection strategies for text generation. Samplers can be chained together to create complex sampling pipelines.

Sampler Chain

A sampler chain applies multiple sampling strategies in sequence:

// Initialize sampler chain
LLAMA_API struct llama_sampler * llama_sampler_chain_init(
    struct llama_sampler_chain_params params
);

// Add sampler to chain (takes ownership)
LLAMA_API void llama_sampler_chain_add(
    struct llama_sampler * chain,
    struct llama_sampler * smpl
);

// Get sampler at index (-1 returns the chain itself)
LLAMA_API struct llama_sampler * llama_sampler_chain_get(
    struct llama_sampler * chain,
    int32_t i
);

// Get number of samplers
LLAMA_API int llama_sampler_chain_n(const struct llama_sampler * chain);

// Remove sampler (returns ownership to caller)
LLAMA_API struct llama_sampler * llama_sampler_chain_remove(
    struct llama_sampler * chain,
    int32_t i
);

Chain Parameters

typedef struct llama_sampler_chain_params {
    bool no_perf;  // Disable performance timing
} llama_sampler_chain_params;

LLAMA_API struct llama_sampler_chain_params llama_sampler_chain_default_params(void);

Basic Usage

// Create sampler chain
auto sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;

llama_sampler * smpl = llama_sampler_chain_init(sparams);

// Add greedy sampler (always picks highest probability token)
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());

// Use in generation loop
while (...) {
    llama_decode(ctx, batch);
    llama_token token = llama_sampler_sample(smpl, ctx, -1);
    // ...
}

llama_sampler_free(smpl);

Sampling Function

LLAMA_API llama_token llama_sampler_sample(
    struct llama_sampler * smpl,
    struct llama_context * ctx,
    int32_t idx
);

smpl

llama_sampler *

Sampler chain to use

ctx

llama_context *

Context containing logits from latest decode

idx

int32_t

Index of token to sample from (use -1 for last token, supports negative indexing)

return

llama_token

The sampled token ID

This function is shorthand for getting logits, applying the sampler chain, and accepting the selected token.

Available Samplers

Greedy Sampling

Always selects the token with highest probability:

LLAMA_API struct llama_sampler * llama_sampler_init_greedy(void);

Use for deterministic, focused output.

Distribution Sampling

Samples from the probability distribution:

LLAMA_API struct llama_sampler * llama_sampler_init_dist(uint32_t seed);

seed

uint32_t

Random seed (use LLAMA_DEFAULT_SEED for random seed)

Must be the last sampler in the chain (like greedy).

Top-K Sampling

Keeps only the top K most likely tokens:

LLAMA_API struct llama_sampler * llama_sampler_init_top_k(int32_t k);

int32_t

Number of top tokens to keep (<= 0 disables)

Reference: “The Curious Case of Neural Text Degeneration”

// Keep top 40 tokens
llama_sampler * top_k = llama_sampler_init_top_k(40);
llama_sampler_chain_add(chain, top_k);

Top-P (Nucleus) Sampling

Keeps tokens with cumulative probability >= p:

LLAMA_API struct llama_sampler * llama_sampler_init_top_p(
    float p,
    size_t min_keep
);

float

Cumulative probability threshold (0.0 to 1.0)

min_keep

size_t

Minimum number of tokens to keep

Reference: “The Curious Case of Neural Text Degeneration”

// Keep tokens until 90% cumulative probability
llama_sampler * top_p = llama_sampler_init_top_p(0.9, 1);
llama_sampler_chain_add(chain, top_p);

Min-P Sampling

Keeps tokens with probability >= p * max_probability:

LLAMA_API struct llama_sampler * llama_sampler_init_min_p(
    float p,
    size_t min_keep
);

float

Minimum probability threshold (relative to max)

min_keep

size_t

Minimum number of tokens to keep

Reference: https://github.com/ggml-org/llama.cpp/pull/3841

Temperature Sampling

Scales logits by temperature (higher = more random):

LLAMA_API struct llama_sampler * llama_sampler_init_temp(float t);

float

Temperature value. t <= 0.0 keeps only the maximum logit, rest set to -inf

Formula: logit' = logit / temperature

// Conservative sampling
llama_sampler_init_temp(0.2);  // Very focused

// Balanced sampling
llama_sampler_init_temp(0.8);  // Standard

// Creative sampling
llama_sampler_init_temp(1.2);  // More random

Dynamic Temperature

Adaptive temperature based on entropy:

LLAMA_API struct llama_sampler * llama_sampler_init_temp_ext(
    float t,
    float delta,
    float exponent
);

float

Base temperature

delta

float

Temperature adjustment range

exponent

float

Entropy scaling exponent

Reference: https://arxiv.org/abs/2309.02772

Typical Sampling

Samples locally typical tokens:

LLAMA_API struct llama_sampler * llama_sampler_init_typical(
    float p,
    size_t min_keep
);

Reference: https://arxiv.org/abs/2202.00666

Mirostat Sampling

Adaptive sampling that targets a specific perplexity:

LLAMA_API struct llama_sampler * llama_sampler_init_mirostat(
    int32_t n_vocab,
    uint32_t seed,
    float tau,      // Target cross-entropy
    float eta,      // Learning rate
    int32_t m       // Tokens for estimation
);

// Example
int32_t n_vocab = llama_vocab_n_tokens(vocab);
llama_sampler * miro = llama_sampler_init_mirostat(
    n_vocab,
    LLAMA_DEFAULT_SEED,
    5.0,   // tau
    0.1,   // eta
    100    // m
);

Reference: https://arxiv.org/abs/2007.14966

Mirostat samplers select the final token, so they should be last in the chain (like greedy or dist).

Penalty Samplers

Penalize repeated tokens:

LLAMA_API struct llama_sampler * llama_sampler_init_penalties(
    int32_t penalty_last_n,    // Last n tokens to consider (0 = disable, -1 = ctx size)
    float penalty_repeat,      // Repetition penalty (1.0 = disabled)
    float penalty_freq,        // Frequency penalty (0.0 = disabled)
    float penalty_present      // Presence penalty (0.0 = disabled)
);

penalty_last_n

int32_t

Number of recent tokens to penalize (0 = disabled, -1 = full context)

penalty_repeat

float

Repetition penalty multiplier (1.0 = no penalty, > 1.0 = penalize)

penalty_freq

float

Frequency penalty (0.0 = disabled)

penalty_present

float

Presence penalty (0.0 = disabled)

// Penalize repetition in last 64 tokens
llama_sampler * penalties = llama_sampler_init_penalties(
    64,    // last_n
    1.1,   // repeat penalty
    0.0,   // freq penalty
    0.0    // presence penalty
);
llama_sampler_chain_add(chain, penalties);

Avoid using penalties with full vocabulary as searching can be slow. Apply top-k/top-p first.

DRY Sampler

“Don’t Repeat Yourself” sampler:

LLAMA_API struct llama_sampler * llama_sampler_init_dry(
    const struct llama_vocab * vocab,
    int32_t n_ctx_train,
    float dry_multiplier,
    float dry_base,
    int32_t dry_allowed_length,
    int32_t dry_penalty_last_n,
    const char ** seq_breakers,
    size_t num_breakers
);

Reference: https://github.com/oobabooga/text-generation-webui/pull/5677

Adaptive-P Sampler

Maintains target probability over time:

LLAMA_API struct llama_sampler * llama_sampler_init_adaptive_p(
    float target,     // Target probability (0.0-1.0, negative = disabled)
    float decay,      // EMA decay (0.0-0.99)
    uint32_t seed     // Random seed
);

Adaptive-P selects the final token and should be last in the chain. Use mild truncation (e.g., min-p) before this sampler.

Reference: https://github.com/ggml-org/llama.cpp/pull/17927

XTC Sampler

LLAMA_API struct llama_sampler * llama_sampler_init_xtc(
    float p,
    float t,
    size_t min_keep,
    uint32_t seed
);

Reference: https://github.com/oobabooga/text-generation-webui/pull/6335

Top-nσ Sampler

LLAMA_API struct llama_sampler * llama_sampler_init_top_n_sigma(float n);

Reference: https://arxiv.org/pdf/2411.07641

Grammar Sampler

Constrain output to match a GBNF grammar:

LLAMA_API struct llama_sampler * llama_sampler_init_grammar(
    const struct llama_vocab * vocab,
    const char * grammar_str,
    const char * grammar_root
);

vocab

const llama_vocab *

Vocabulary for tokenization

grammar_str

const char *

GBNF grammar production rules

grammar_root

const char *

Start symbol name

See grammars/README.md for grammar syntax.

Logit Bias

Manually bias specific tokens:

typedef struct llama_logit_bias {
    llama_token token;
    float bias;
} llama_logit_bias;

LLAMA_API struct llama_sampler * llama_sampler_init_logit_bias(
    int32_t n_vocab,
    int32_t n_logit_bias,
    const llama_logit_bias * logit_bias
);

// Increase probability of token 123, decrease 456
llama_logit_bias biases[] = {
    {.token = 123, .bias = 1.5},   // Boost
    {.token = 456, .bias = -1.5},  // Suppress
};

llama_sampler * bias = llama_sampler_init_logit_bias(
    llama_vocab_n_tokens(vocab),
    2,
    biases
);

Infill Sampler

For fill-in-the-middle tasks:

LLAMA_API struct llama_sampler * llama_sampler_init_infill(
    const struct llama_vocab * vocab
);

Use after top-k/top-p. Combines prefix probabilities and handles EOG tokens specially.

Sampler Management

Clone and Free

// Clone a sampler
LLAMA_API struct llama_sampler * llama_sampler_clone(
    const struct llama_sampler * smpl
);

// Free a sampler (don't free if added to chain)
LLAMA_API void llama_sampler_free(struct llama_sampler * smpl);

Do not manually free samplers that have been added to a chain. The chain takes ownership and will free them automatically.

Manual Application

// Apply sampler to token data array
LLAMA_API void llama_sampler_apply(
    struct llama_sampler * smpl,
    llama_token_data_array * cur_p
);

// Accept a selected token (update sampler state)
LLAMA_API void llama_sampler_accept(
    struct llama_sampler * smpl,
    llama_token token
);

// Reset sampler state
LLAMA_API void llama_sampler_reset(struct llama_sampler * smpl);

// Get sampler name
LLAMA_API const char * llama_sampler_name(const struct llama_sampler * smpl);

Get Seed

LLAMA_API uint32_t llama_sampler_get_seed(const struct llama_sampler * smpl);

Returns the seed used by the sampler, or LLAMA_DEFAULT_SEED if not applicable.

Performance Monitoring

struct llama_perf_sampler_data {
    double t_sample_ms;  // Time spent sampling (milliseconds)
    int32_t n_sample;    // Number of tokens sampled
};

LLAMA_API struct llama_perf_sampler_data llama_perf_sampler(
    const struct llama_sampler * chain
);

LLAMA_API void llama_perf_sampler_print(const struct llama_sampler * chain);
LLAMA_API void llama_perf_sampler_reset(struct llama_sampler * chain);

Performance functions only work with sampler chains created via llama_sampler_chain_init.

Common Sampling Configurations

llama_sampler * smpl = llama_sampler_chain_init(
    llama_sampler_chain_default_params()
);
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());

Complete Sampling Example

#include "llama.h"
#include <stdio.h>

int main() {
    // ... (model and context setup) ...
    
    // Create comprehensive sampling chain
    llama_sampler * smpl = llama_sampler_chain_init(
        llama_sampler_chain_default_params()
    );
    
    // 1. Remove unlikely tokens (top-k)
    llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
    
    // 2. Penalize repetition
    llama_sampler_chain_add(smpl, 
        llama_sampler_init_penalties(
            64,    // last 64 tokens
            1.1,   // repeat penalty
            0.0,   // freq penalty
            0.0    // presence penalty
        )
    );
    
    // 3. Apply nucleus sampling
    llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.9, 1));
    
    // 4. Scale by temperature
    llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
    
    // 5. Sample from distribution
    llama_sampler_chain_add(smpl, 
        llama_sampler_init_dist(LLAMA_DEFAULT_SEED)
    );
    
    // Generation loop
    for (int i = 0; i < n_predict; i++) {
        // Decode current batch
        if (llama_decode(ctx, batch) != 0) {
            fprintf(stderr, "Decode failed\n");
            break;
        }
        
        // Sample next token
        llama_token token = llama_sampler_sample(smpl, ctx, -1);
        
        if (llama_vocab_is_eog(vocab, token)) {
            break;
        }
        
        // Print and continue
        char buf[128];
        int n = llama_token_to_piece(vocab, token, buf, sizeof(buf), 0, true);
        printf("%.*s", n, buf);
        fflush(stdout);
        
        batch = llama_batch_get_one(&token, 1);
    }
    
    // Print performance stats
    llama_perf_sampler_print(smpl);
    
    // Cleanup
    llama_sampler_free(smpl);
    // ... (free context and model) ...
    
    return 0;
}

Token Data Array (Advanced)

For manual sampling without llama_sampler_sample:

typedef struct llama_token_data {
    llama_token id;   // Token ID
    float logit;      // Log-odds
    float p;          // Probability
} llama_token_data;

typedef struct llama_token_data_array {
    llama_token_data * data;
    size_t size;
    int64_t selected;  // Index of selected token (not token ID)
    bool sorted;       // Don't assume sorted, always check this flag
} llama_token_data_array;

C/C++ API

REST API

Tools

Overview

Sampler Chain

Chain Parameters

Basic Usage

Sampling Function

Available Samplers

Greedy Sampling

Distribution Sampling

Top-K Sampling

Top-P (Nucleus) Sampling

Min-P Sampling

Temperature Sampling

Dynamic Temperature

Typical Sampling

Mirostat Sampling

Penalty Samplers

DRY Sampler

Adaptive-P Sampler

XTC Sampler

Top-nσ Sampler

Grammar Sampler

Logit Bias

Infill Sampler

Sampler Management

Clone and Free

Manual Application

Get Seed

Performance Monitoring

Common Sampling Configurations

Complete Sampling Example

Token Data Array (Advanced)

Next Steps

Inference

libllama Overview

C/C++ API

REST API

Tools

​Overview

​Sampler Chain

​Chain Parameters

​Basic Usage

​Sampling Function

​Available Samplers

​Greedy Sampling

​Distribution Sampling

​Top-K Sampling

​Top-P (Nucleus) Sampling

​Min-P Sampling

​Temperature Sampling

​Dynamic Temperature

​Typical Sampling

​Mirostat Sampling

​Penalty Samplers

​DRY Sampler

​Adaptive-P Sampler

​XTC Sampler

​Top-nσ Sampler

​Grammar Sampler

​Logit Bias

​Infill Sampler

​Sampler Management

​Clone and Free

​Manual Application

​Get Seed

​Performance Monitoring

​Common Sampling Configurations

​Complete Sampling Example

​Token Data Array (Advanced)

​Next Steps

Inference

libllama Overview

Overview

Sampler Chain

Chain Parameters

Basic Usage

Sampling Function

Available Samplers

Greedy Sampling

Distribution Sampling

Top-K Sampling

Top-P (Nucleus) Sampling

Min-P Sampling

Temperature Sampling

Dynamic Temperature

Typical Sampling

Mirostat Sampling

Penalty Samplers

DRY Sampler

Adaptive-P Sampler

XTC Sampler

Top-nσ Sampler

Grammar Sampler

Logit Bias

Infill Sampler

Sampler Management

Clone and Free

Manual Application

Get Seed

Performance Monitoring

Common Sampling Configurations

Complete Sampling Example

Token Data Array (Advanced)

Next Steps