Skip to main content
This guide demonstrates basic usage of the ONNX Runtime GenAI C++ API for question-answering tasks with streaming output.

Overview

The basic example shows how to:
  • Create a model and tokenizer
  • Set up a generator with custom parameters
  • Process user input and generate responses
  • Stream output tokens in real-time

Prerequisites

Before running the examples, you need to:
  1. Install ONNX Runtime GenAI headers and libraries
  2. Download a compatible model
  3. Set up your build environment with CMake
See the installation guide for detailed setup instructions.

Simple Question-Answering Example

This example demonstrates streaming text generation with the C++ API:
#include <iostream>
#include <string>
#include "ort_genai.h"
#include "common.h"

void CXX_API(
    GeneratorParamsArgs& generator_params_args,
    const std::string& model_path,
    const std::string& ep,
    const std::string& system_prompt,
    const std::string& user_prompt,
    bool verbose) {
  
  // Register execution provider
  RegisterEP(ep, ep_path);

  // Create configuration
  std::unordered_map<std::string, std::string> ep_options;
  auto config = GetConfig(model_path, ep, ep_options, generator_params_args);

  // Create model
  auto model = OgaModel::Create(*config);

  // Create tokenizer and stream
  auto tokenizer = OgaTokenizer::Create(*model);
  auto stream = OgaTokenizerStream::Create(*tokenizer);

  // Create running list of messages
  std::vector<nlohmann::ordered_json> input_list;
  nlohmann::ordered_json system_message = {
    {"role", "system"}, 
    {"content", system_prompt}
  };
  input_list.push_back(system_message);

  // Add user message
  nlohmann::ordered_json user_message = {
    {"role", "user"}, 
    {"content", user_prompt}
  };
  input_list.push_back(user_message);
  nlohmann::ordered_json j = input_list;
  std::string messages = j.dump();

  // Initialize generator params
  auto params = OgaGeneratorParams::Create(*model);
  SetSearchOptions(*params, generator_params_args, verbose);

  // Create generator
  auto generator = OgaGenerator::Create(*model, *params);

  // Apply chat template
  bool add_generation_prompt = true;
  std::string prompt = ApplyChatTemplate(
    model_path, *tokenizer, messages, add_generation_prompt
  );

  // Encode prompt and append tokens
  auto sequences = OgaSequences::Create();
  tokenizer->Encode(prompt.c_str(), *sequences);
  generator->AppendTokenSequences(*sequences);

  // Run generation loop with streaming output
  std::cout << "Output: ";
  while (!generator->IsDone()) {
    generator->GenerateNextToken();
    
    const auto new_token = generator->GetNextTokens()[0];
    std::cout << stream->Decode(new_token) << std::flush;
  }
  std::cout << std::endl;
}

Key Components

Model Initialization

The example starts by creating the core components:
// Create configuration for the model
auto config = GetConfig(model_path, ep, ep_options, generator_params_args);

// Create model instance
auto model = OgaModel::Create(*config);

// Create tokenizer for encoding/decoding text
auto tokenizer = OgaTokenizer::Create(*model);

Generator Setup

Set up the generator with custom parameters:
// Create generator parameters
auto params = OgaGeneratorParams::Create(*model);
SetSearchOptions(*params, generator_params_args, verbose);

// Create the generator
auto generator = OgaGenerator::Create(*model, *params);

Streaming Output

The generation loop streams tokens as they’re generated:
// Create tokenizer stream for decoding
auto stream = OgaTokenizerStream::Create(*tokenizer);

// Generate and stream tokens
while (!generator->IsDone()) {
  generator->GenerateNextToken();
  const auto new_token = generator->GetNextTokens()[0];
  std::cout << stream->Decode(new_token) << std::flush;
}

Building the Example

Use CMake to build the example:
cd examples/c
cmake -G "Visual Studio 17 2022" -S . -B build -DMODEL_QA=ON
cmake --build build --parallel --config Debug

Running the Example

Run the compiled example with your model:
cd build/Debug
.\model_qa.exe -m {path to model folder} -e {execution provider}

Command-Line Options

  • -m, --model_path: Path to the model folder containing GenAI config
  • -e, --execution_provider: Execution provider (cpu, cuda, dml, etc.)
  • -s, --system_prompt: System prompt for the model (default: “You are a helpful AI assistant.”)
  • -u, --user_prompt: User prompt (default: “What color is the sky?”)
  • -v, --verbose: Enable verbose logging
  • --interactive: Run in interactive mode for multi-turn conversations

Example Output

--------------------------
Hello, ORT GenAI Model-QA!
--------------------------
Model path: ./models/phi-3-mini
Execution provider: cuda
System prompt: You are a helpful AI assistant.
User prompt: What color is the sky?
--------------------------

Output: The sky appears blue during the day due to Rayleigh scattering, 
where shorter blue wavelengths of sunlight scatter more in Earth's 
atmosphere than longer wavelengths like red.

Prompt tokens: 28, New tokens: 45
Time to first token: 0.123s
Tokens per second: 365.85

Next Steps

Build docs developers (and LLMs) love