Basic C/C++ Usage

This guide demonstrates basic usage of the ONNX Runtime GenAI C++ API for question-answering tasks with streaming output.

Overview

The basic example shows how to:

Create a model and tokenizer
Set up a generator with custom parameters
Process user input and generate responses
Stream output tokens in real-time

Prerequisites

Before running the examples, you need to:

Install ONNX Runtime GenAI headers and libraries
Download a compatible model
Set up your build environment with CMake

See the installation guide for detailed setup instructions.

Simple Question-Answering Example

This example demonstrates streaming text generation with the C++ API:

#include <iostream>
#include <string>
#include "ort_genai.h"
#include "common.h"

void CXX_API(
    GeneratorParamsArgs& generator_params_args,
    const std::string& model_path,
    const std::string& ep,
    const std::string& system_prompt,
    const std::string& user_prompt,
    bool verbose) {
  
  // Register execution provider
  RegisterEP(ep, ep_path);

  // Create configuration
  std::unordered_map<std::string, std::string> ep_options;
  auto config = GetConfig(model_path, ep, ep_options, generator_params_args);

  // Create model
  auto model = OgaModel::Create(*config);

  // Create tokenizer and stream
  auto tokenizer = OgaTokenizer::Create(*model);
  auto stream = OgaTokenizerStream::Create(*tokenizer);

  // Create running list of messages
  std::vector<nlohmann::ordered_json> input_list;
  nlohmann::ordered_json system_message = {
    {"role", "system"}, 
    {"content", system_prompt}
  };
  input_list.push_back(system_message);

  // Add user message
  nlohmann::ordered_json user_message = {
    {"role", "user"}, 
    {"content", user_prompt}
  };
  input_list.push_back(user_message);
  nlohmann::ordered_json j = input_list;
  std::string messages = j.dump();

  // Initialize generator params
  auto params = OgaGeneratorParams::Create(*model);
  SetSearchOptions(*params, generator_params_args, verbose);

  // Create generator
  auto generator = OgaGenerator::Create(*model, *params);

  // Apply chat template
  bool add_generation_prompt = true;
  std::string prompt = ApplyChatTemplate(
    model_path, *tokenizer, messages, add_generation_prompt
  );

  // Encode prompt and append tokens
  auto sequences = OgaSequences::Create();
  tokenizer->Encode(prompt.c_str(), *sequences);
  generator->AppendTokenSequences(*sequences);

  // Run generation loop with streaming output
  std::cout << "Output: ";
  while (!generator->IsDone()) {
    generator->GenerateNextToken();
    
    const auto new_token = generator->GetNextTokens()[0];
    std::cout << stream->Decode(new_token) << std::flush;
  }
  std::cout << std::endl;
}

Key Components

Model Initialization

The example starts by creating the core components:

// Create configuration for the model
auto config = GetConfig(model_path, ep, ep_options, generator_params_args);

// Create model instance
auto model = OgaModel::Create(*config);

// Create tokenizer for encoding/decoding text
auto tokenizer = OgaTokenizer::Create(*model);

Generator Setup

Set up the generator with custom parameters:

// Create generator parameters
auto params = OgaGeneratorParams::Create(*model);
SetSearchOptions(*params, generator_params_args, verbose);

// Create the generator
auto generator = OgaGenerator::Create(*model, *params);

Streaming Output

The generation loop streams tokens as they’re generated:

// Create tokenizer stream for decoding
auto stream = OgaTokenizerStream::Create(*tokenizer);

// Generate and stream tokens
while (!generator->IsDone()) {
  generator->GenerateNextToken();
  const auto new_token = generator->GetNextTokens()[0];
  std::cout << stream->Decode(new_token) << std::flush;
}

Building the Example

Use CMake to build the example:

cd examples/c
cmake -G "Visual Studio 17 2022" -S . -B build -DMODEL_QA=ON
cmake --build build --parallel --config Debug

Running the Example

Run the compiled example with your model:

cd build/Debug
.\model_qa.exe -m {path to model folder} -e {execution provider}

Command-Line Options

-m, --model_path: Path to the model folder containing GenAI config
-e, --execution_provider: Execution provider (cpu, cuda, dml, etc.)
-s, --system_prompt: System prompt for the model (default: “You are a helpful AI assistant.”)
-u, --user_prompt: User prompt (default: “What color is the sky?”)
-v, --verbose: Enable verbose logging
--interactive: Run in interactive mode for multi-turn conversations

Example Output

--------------------------
Hello, ORT GenAI Model-QA!
--------------------------
Model path: ./models/phi-3-mini
Execution provider: cuda
System prompt: You are a helpful AI assistant.
User prompt: What color is the sky?
--------------------------

Output: The sky appears blue during the day due to Rayleigh scattering, 
where shorter blue wavelengths of sunlight scatter more in Earth's 
atmosphere than longer wavelengths like red.

Prompt tokens: 28, New tokens: 45
Time to first token: 0.123s
Tokens per second: 365.85

Next Steps

Explore advanced features like multi-turn chat and custom configurations
Learn about execution providers
Understand model configuration

Python Examples

C# Examples

C/C++ Examples

Overview

Prerequisites

Simple Question-Answering Example

Key Components

Model Initialization

Generator Setup

Streaming Output

Building the Example

Running the Example

Command-Line Options

Example Output

Next Steps

Build docs developers (and LLMs) love

Python Examples

C# Examples

C/C++ Examples

​Overview

​Prerequisites

​Simple Question-Answering Example

​Key Components

​Model Initialization

​Generator Setup

​Streaming Output

​Building the Example

​Running the Example

​Command-Line Options

​Example Output

​Next Steps

Build docs developers (and LLMs) love

Overview

Prerequisites

Simple Question-Answering Example

Key Components

Model Initialization

Generator Setup

Streaming Output

Building the Example

Running the Example

Command-Line Options

Example Output

Next Steps