Architecture Overview
NeMo Guardrails uses an event-driven runtime architecture to process conversations through multiple stages of guardrails. Understanding this architecture helps you build more effective and efficient guardrails.
High-Level Architecture
The NeMo Guardrails library acts as an intermediary layer between your application code and LLM requests/responses:- Application sends a user message to Guardrails
- Guardrails applies input rails, dialog rails, and potentially retrieval/execution rails
- Guardrails calls the LLM when needed
- Guardrails applies output rails to the response
- Guardrails returns the validated response to Application
Core Components
RailsConfig
TheRailsConfig class is the central configuration object that defines:
- Models
- Rails
- Flows
- Actions
LLM and embedding model configurations:
LLMRails
TheLLMRails class is the main entry point for using guardrails. It:
- Initializes the runtime based on the Colang version (1.0 or 2.x)
- Loads and registers all actions
- Manages the conversation state
- Orchestrates the guardrails processing pipeline
Key Methods
generate() / generate_async()
generate() / generate_async()
Main method for getting LLM responses with guardrails applied:
generate_events() / generate_events_async()
generate_events() / generate_events_async()
Lower-level method that returns the full event stream:
register_action()
register_action()
Register custom actions dynamically:
Runtime (Event-Driven Engine)
The runtime is the core event processing engine. There are two implementations:RuntimeV1_0
Runtime for Colang 1.0:
- Flows are active by default
- Uses pattern matching for user/bot messages
- Simpler, more implicit behavior
RuntimeV2_x
Runtime for Colang 2.0:
- Explicit flow activation
- More control over event handling
- Supports advanced features like the
...operator
- Process events in an async event loop
- Execute actions and flows
- Generate LLM prompts and parse responses
- Maintain conversation state
The Guardrails Processing Pipeline
Here’s what happens when a user message is processed:Stage 1: Generate Canonical User Message
Generate User Intent
The
generate_user_intent action:- Performs vector search on user message examples
- Includes top 5 matches in the prompt
- Asks the LLM to generate the canonical form
Stage 2: Decide Next Steps
Once theUserIntent event exists, the runtime determines what happens next.
- Path 1: Predefined Flow
- Path 2: LLM-Generated Step
If a flow matches, it executes directly:
- Bot Message (
BotIntentevent) → Generate utterance - Action Call (
StartInternalSystemActionevent) → Execute action
Stage 3: Execute Actions (if needed)
When an action is triggered:Stage 4: Generate Bot Utterance
When aBotIntent event is generated:
Retrieve Context (RAG)
If a knowledge base is configured:The
retrieve_relevant_chunks action:- Searches the knowledge base
- Applies retrieval rails to filter chunks
- Adds relevant chunks to the prompt context
Generate Utterance
The
generate_bot_message action:- Performs vector search on bot message examples
- Includes top 5 matches in the prompt
- Includes retrieved chunks (if any)
- Asks the LLM to generate the response
Complete Event Stream Example
Here’s a real event stream for processing “Hello”:Async-First Design
NeMo Guardrails is built with async/await from the ground up:Why Async?
Better Concurrency
Multiple users can be served simultaneously. While one request waits for an LLM response, others continue processing.
Non-Blocking I/O
LLM calls, API requests, and database queries don’t block the event loop.
Efficient Resource Usage
Better CPU and memory utilization during I/O-bound operations.
Dual API
Both sync and async methods available for compatibility.
Sync vs Async Usage
Custom Async Actions
Actions should be async for better performance:Caching and Performance
NeMo Guardrails includes several caching mechanisms:Model Output Caching
Cache LLM responses to avoid redundant calls:Embeddings Caching
Vector embeddings are cached automatically for:- User message examples
- Bot message examples
- Flow definitions
- Knowledge base chunks
History Cache
The events history for user message sequences is cached to maintain state across turns.Extending the Architecture
You can extend NeMo Guardrails in several ways:- Custom Actions
- Custom LLM Providers
- Custom Embedding Providers
- LangChain Integration
Add new Python functions:
Configuration Loading
The configuration loading process:Next Steps
Build Your First Config
Create your first guardrails configuration
Custom Actions
Learn how to write custom Python actions
Advanced Flows
Master complex Colang flow patterns
Performance Tuning
Optimize your guardrails for production