Architecture Overview

NeMo Guardrails uses an event-driven runtime architecture to process conversations through multiple stages of guardrails. Understanding this architecture helps you build more effective and efficient guardrails.

High-Level Architecture

The NeMo Guardrails library acts as an intermediary layer between your application code and LLM requests/responses:

Application sends a user message to Guardrails
Guardrails applies input rails, dialog rails, and potentially retrieval/execution rails
Guardrails calls the LLM when needed
Guardrails applies output rails to the response
Guardrails returns the validated response to Application

Core Components

RailsConfig

The RailsConfig class is the central configuration object that defines:

Models
Rails
Flows
Actions

LLM and embedding model configurations:

from nemoguardrails.rails.llm.config import Model, RailsConfig

config = RailsConfig(
    models=[
        Model(
            type="main",
            engine="openai",
            model="gpt-4o-mini"
        ),
        Model(
            type="embeddings",
            engine="openai",
            model="text-embedding-ada-002"
        )
    ]
)

Configuration for each rail type:

config = RailsConfig(
    rails={
        "input": {
            "flows": ["check jailbreak", "mask pii"]
        },
        "output": {
            "flows": ["self check facts"]
        }
    }
)

Colang flow definitions loaded from .co files:

# Loaded automatically from config path
config = RailsConfig.from_path("./config")

# config/rails.co contains flow definitions

Custom Python actions registered for use in flows:

from nemoguardrails.actions import action

@action()
async def custom_check(context: dict):
    # Custom logic
    return True

LLMRails

The LLMRails class is the main entry point for using guardrails. It:

Initializes the runtime based on the Colang version (1.0 or 2.x)
Loads and registers all actions
Manages the conversation state
Orchestrates the guardrails processing pipeline

from nemoguardrails import LLMRails, RailsConfig

# Initialize
config = RailsConfig.from_path("./config")
rails = LLMRails(config, verbose=True)

# Use
response = rails.generate(
    messages=[{"role": "user", "content": "Hello!"}]
)

Key Methods

generate() / generate_async()

Main method for getting LLM responses with guardrails applied:

# Sync version
response = rails.generate(
    messages=[{"role": "user", "content": "Hello"}]
)

# Async version
response = await rails.generate_async(
    messages=[{"role": "user", "content": "Hello"}]
)

generate_events() / generate_events_async()

Lower-level method that returns the full event stream:

events = await rails.generate_events_async([
    {"type": "UtteranceUserActionFinished", "final_transcript": "Hello"}
])

register_action()

async def my_action():
    return "result"

rails.register_action(my_action, name="my_action")

Runtime (Event-Driven Engine)

The runtime is the core event processing engine. There are two implementations:

RuntimeV1_0

Runtime for Colang 1.0:

Flows are active by default
Uses pattern matching for user/bot messages
Simpler, more implicit behavior

RuntimeV2_x

Runtime for Colang 2.0:

Explicit flow activation
More control over event handling
Supports advanced features like the ... operator

Both runtimes:

Process events in an async event loop
Execute actions and flows
Generate LLM prompts and parse responses
Maintain conversation state

The Guardrails Processing Pipeline

Here’s what happens when a user message is processed:

Stage 1: Generate Canonical User Message

Receive User Utterance

An UtteranceUserActionFinished event is created with the user’s text:

{
    "type": "UtteranceUserActionFinished",
    "final_transcript": "Hello, how are you?"
}

Apply Input Rails

Any configured input rails are executed to validate/transform the input.

Generate User Intent

The generate_user_intent action:

Performs vector search on user message examples
Includes top 5 matches in the prompt
Asks the LLM to generate the canonical form

define flow generate user intent
  event UtteranceUserActionFinished(final_transcript="...")
  execute generate_user_intent

Create UserIntent Event

A UserIntent event is generated:

{
    "type": "UserIntent",
    "intent": "user express greeting"
}

Stage 2: Decide Next Steps

Once the UserIntent event exists, the runtime determines what happens next.

Path 1: Predefined Flow
Path 2: LLM-Generated Step

If a flow matches, it executes directly:

define flow greeting
  user express greeting  # Matches!
  bot express greeting   # Execute this next

If no flow matches, ask the LLM:

define flow generate next step
  priority 0.9  # Lower than default flows
  user ...
  execute generate_next_step

The generate_next_step action:

Performs vector search on relevant flows
Includes top 5 flows in the prompt
Asks LLM to predict the next step

Next steps can be:

Bot Message (BotIntent event) → Generate utterance
Action Call (StartInternalSystemAction event) → Execute action

Stage 3: Execute Actions (if needed)

When an action is triggered:

Start Action

StartInternalSystemAction event is created

Apply Execution Rails

Validate action inputs if execution rails are configured

Execute Action

The Python function is called (async, non-blocking)

Apply Execution Rails

Validate action outputs

Finish Action

InternalSystemActionFinished event is created with the result

Stage 4: Generate Bot Utterance

When a BotIntent event is generated:

Retrieve Context (RAG)

If a knowledge base is configured:

define extension flow generate bot message
  priority 100
  bot ...
  execute retrieve_relevant_chunks
  execute generate_bot_message

The retrieve_relevant_chunks action:

Searches the knowledge base
Applies retrieval rails to filter chunks
Adds relevant chunks to the prompt context

Generate Utterance

The generate_bot_message action:

Performs vector search on bot message examples
Includes top 5 matches in the prompt
Includes retrieved chunks (if any)
Asks the LLM to generate the response

Apply Output Rails

Configured output rails validate the response:

define flow self check facts
  bot ...
  $check = execute fact_check
  if not $check
    bot inform cannot answer
    stop

Create StartUtteranceBotAction

Final event is created with the bot’s response

Complete Event Stream Example

Here’s a real event stream for processing “Hello”:

[
    # 1. User input
    {
        "type": "UtteranceUserActionFinished",
        "final_transcript": "Hello"
    },
    
    # 2. Canonical form generated
    {
        "type": "UserIntent",
        "intent": "user express greeting"
    },
    
    # 3. Bot intent decided
    {
        "type": "BotIntent",
        "intent": "bot express greeting"
    },
    
    # 4. Bot utterance generated
    {
        "type": "StartUtteranceBotAction",
        "script": "Hello there! How can I help you today?"
    }
]

Async-First Design

NeMo Guardrails is built with async/await from the ground up:

Why Async?

Better Concurrency

Multiple users can be served simultaneously. While one request waits for an LLM response, others continue processing.

Non-Blocking I/O

LLM calls, API requests, and database queries don’t block the event loop.

Efficient Resource Usage

Better CPU and memory utilization during I/O-bound operations.

Dual API

Both sync and async methods available for compatibility.

Sync vs Async Usage

from nemoguardrails import LLMRails, RailsConfig

config = RailsConfig.from_path("./config")
rails = LLMRails(config)

# Synchronous (blocks until complete)
response = rails.generate(
    messages=[{"role": "user", "content": "Hello"}]
)

# Asynchronous (non-blocking)
import asyncio

async def chat():
    response = await rails.generate_async(
        messages=[{"role": "user", "content": "Hello"}]
    )
    return response

response = asyncio.run(chat())

Always use async methods (generate_async) in async contexts to avoid blocking the event loop.

Custom Async Actions

Actions should be async for better performance:

from nemoguardrails.actions import action
import httpx

@action()
async def fetch_weather(city: str):
    """Fetch weather data asynchronously."""
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://api.weather.com/{city}")
        return response.json()

Caching and Performance

NeMo Guardrails includes several caching mechanisms:

Model Output Caching

Cache LLM responses to avoid redundant calls:

# config.yml
models:
  - type: main
    engine: openai
    model: gpt-4o-mini
    
model_cache:
  enabled: true
  maxsize: 50000
  stats:
    enabled: true
    log_interval: 60  # Log cache stats every 60 seconds

Embeddings Caching

Vector embeddings are cached automatically for:

User message examples
Bot message examples
Flow definitions
Knowledge base chunks

History Cache

The events history for user message sequences is cached to maintain state across turns.

Extending the Architecture

You can extend NeMo Guardrails in several ways:

Custom Actions
Custom LLM Providers
Custom Embedding Providers
LangChain Integration

Add new Python functions:

# config/actions.py
from nemoguardrails.actions import action

@action()
async def my_custom_action(param: str):
    # Your logic here
    return result

# config/config.py
from nemoguardrails.llm.providers import register_llm_provider

def init(app):
    register_llm_provider(
        "my_provider",
        MyCustomLLMProvider
    )

Add embedding model support:

from nemoguardrails.embeddings.providers import register_embedding_provider

rails.register_embedding_provider(
    name="my_embeddings",
    provider=MyEmbeddingProvider
)

Use LangChain components:

from langchain.chains import LLMChain
from nemoguardrails.actions import action

@action()
async def run_chain(query: str):
    chain = LLMChain(...)
    return await chain.arun(query)

Configuration Loading

The configuration loading process:

Load config.yml

Parse YAML configuration for models, rails, instructions

Load .co files

Parse all Colang files in the config directory

Load config.py

Execute custom initialization code (if present)

Load actions.py

Import and register custom actions (if present)

Load library flows

Import built-in guardrails from the library

Initialize runtime

Create the appropriate runtime (V1_0 or V2_x)

Next Steps

Build Your First Config

Create your first guardrails configuration

Custom Actions

Learn how to write custom Python actions

Advanced Flows

Master complex Colang flow patterns

Performance Tuning

Optimize your guardrails for production

Get Started

Core Concepts

Configuration

Guardrails Library

Built-in Guardrails

Usage

Deployment

Evaluation

​Architecture Overview

​High-Level Architecture

​Core Components

​RailsConfig

​LLMRails

​Key Methods

​Runtime (Event-Driven Engine)

RuntimeV1_0

RuntimeV2_x

​The Guardrails Processing Pipeline

​Stage 1: Generate Canonical User Message

​Stage 2: Decide Next Steps

​Stage 3: Execute Actions (if needed)

​Stage 4: Generate Bot Utterance

​Complete Event Stream Example

​Async-First Design

​Why Async?

Better Concurrency

Non-Blocking I/O

Efficient Resource Usage

Dual API

​Sync vs Async Usage

​Custom Async Actions

​Caching and Performance

​Model Output Caching

​Embeddings Caching

​History Cache

​Extending the Architecture

​Configuration Loading

​Next Steps

Build Your First Config

Custom Actions

Advanced Flows

Performance Tuning

Build docs developers (and LLMs) love

Architecture Overview

High-Level Architecture

Core Components

RailsConfig

LLMRails

Key Methods

Runtime (Event-Driven Engine)

The Guardrails Processing Pipeline

Stage 1: Generate Canonical User Message

Stage 2: Decide Next Steps

Stage 3: Execute Actions (if needed)

Stage 4: Generate Bot Utterance

Complete Event Stream Example

Async-First Design

Why Async?

Sync vs Async Usage

Custom Async Actions

Caching and Performance

Model Output Caching

Embeddings Caching

History Cache

Extending the Architecture

Configuration Loading

Next Steps