Prompt Guardrails

Overview

The PromptGuardrail class provides security validation for user prompts before they’re processed by the agent. It filters inappropriate, out-of-scope, or potentially malicious queries using keyword denylists, semantic similarity, and pattern matching.

Security Policy: Fail ClosedAll guardrails fail CLOSED - if an error occurs during validation, the prompt is rejected. This prevents potentially malicious prompts from bypassing security checks due to errors.

Guardrail Class

Implementation in src/copilot/guardrails/prompt_guardrails.py:29:

class PromptGuardrail:
    """Guardrail that rejects prompts containing denylist keywords or programming requests.
    
    Behavior:
    - Treats single-word entries as token matches (word boundaries).
    - Only rejects when a denylist phrase or token is present (case-insensitive).
    - Fails CLOSED on errors: rejects prompts if validation cannot complete.
    
    This intentionally allows vague prompts and misspellings; only exact denylist
    entries trigger rejection.
    """
    
    def __init__(
        self,
        deny_words: Optional[str] = None,
        denylist_path: Optional[str] = None
    ) -> None:
        """Initialize the guardrail with denylist configuration.
        
        Args:
            deny_words: Comma-separated string of words to deny
            denylist_path: Path to JSON file containing denylist
        """
        self.deny_words = deny_words
        if denylist_path:
            self.denylist_path = Path(denylist_path)
        else:
            self.denylist_path = (
                Path(__file__).resolve().parents[3] / "data" / "denylist.json"
            )
        self.semantic_threshold = 0.7
        self.semantic_model_name = "all-MiniLM-L6-v2"
        self._semantic_model: Optional[SentenceTransformer] = None
        self._denylist_embeddings: Optional[np.ndarray] = None
        self._entries: list = []
        
        self._load_denylist()
        if SEMANTIC_AVAILABLE:
            try:
                self._init_semantic_model()
            except Exception as e:
                logger.error(f"Failed to initialize semantic model: {e}")
                self._semantic_model = None

Configuration

Parameters:

deny_words: Comma-separated string of additional words to deny (merged with file-based denylist)
denylist_path: Custom path to JSON denylist file (defaults to data/denylist.json)

Semantic Model Settings:

semantic_threshold: Cosine similarity threshold for semantic matching (default: 0.7)
semantic_model_name: Sentence transformer model (default: all-MiniLM-L6-v2)

Validation Methods

1. Keyword Denylist

Exact keyword and phrase matching (case-insensitive). Implementation (src/copilot/guardrails/prompt_guardrails.py:101):

def contains_denylist_keyword(self, text: str) -> bool:
    """Check if text contains any denylist keywords.
    
    Args:
        text: The text to check
    
    Returns:
        True if text contains denylist keywords, False otherwise
    """
    if not text:
        return False
    lowered = text.lower()
    
    # Check phrases (multi-word entries)
    for phrase in self.phrases:
        if phrase in lowered:
            return True
    
    # Check words (single-word entries with token boundaries)
    tokens = re.findall(r"\w+", lowered)
    token_set = set(tokens)
    
    if any(word in token_set for word in self.words):
        return True
    
    return False

Denylist Structure:

{
  "denylist": [
    "politics",
    "election",
    "violent content",
    "explicit material"
  ]
}

Matching Logic:

Phrases (multi-word): Substring match (e.g., “violent content” matches “create violent content”)
Words (single-word): Token match with word boundaries (e.g., “politics” matches “politics” but not “geopolitics”)

2. Programming Pattern Detection

Rejects requests for code generation or programming help. Implementation (src/copilot/guardrails/prompt_guardrails.py:126):

def _matches_programming_regex(self, text: str) -> bool:
    """Check if text matches programming/code-generation patterns.
    
    Args:
        text: The text to check
    
    Returns:
        True if text appears to be a programming request
    """
    if not text:
        return False
    lowered = text.lower()
    patterns = [
        r"\bwrite\s+(a|an|the)?\s*.*\b(program|script|application|example|code)\b",
        r"\b(example|sample|show|provide)\b.*\b(code|program|script)\b",
        r"\b(c\+\+|cpp|java|python|golang|go|rust|javascript|ts|typescript)\b",
        r"\bhow to implement\b",
        r"\b(read|open|write)\s+(a\s+)?file\b",
    ]
    for p in patterns:
        if re.search(p, lowered):
            return True
    return False

Blocked Patterns:

“write a program/script/application”
“show me example code”
Programming language names (Python, Java, C++, etc.)
“how to implement”
File I/O operations (“read a file”, “write to file”)

Custom Rejection Message:

PROGRAMMING_REJECTION_MESSAGE = "I cannot help with programming, maybe I can help you with a query regarding incidents"

3. Semantic Similarity Filter

Detects prompts semantically similar to denylist entries, catching paraphrases and variations. Implementation (src/copilot/guardrails/prompt_guardrails.py:164):

def _semantic_check(self, text: str) -> bool:
    """Perform semantic similarity check against denylist entries.
    
    Args:
        text: The text to check
    
    Returns:
        True if text is semantically similar to denylist entries (should reject)
    """
    if not SEMANTIC_AVAILABLE:
        return False
    
    try:
        if self._semantic_model is None:
            self._init_semantic_model()
        if self._semantic_model is None:
            # Model initialization failed - fail closed
            logger.warning("Semantic model unavailable, failing closed")
            return True
        
        # Encode the text
        emb = self._semantic_model.encode([text], convert_to_numpy=True)
        norm = np.linalg.norm(emb, axis=1, keepdims=True)
        norm[norm == 0] = 1.0
        emb = emb / norm
        
        # Compare with denylist embeddings
        if getattr(self, "_denylist_embeddings", None) is not None:
            sims = np.dot(self._denylist_embeddings, emb.T).squeeze()
            max_sim = float(np.max(sims)) if sims.size else 0.0
            if max_sim >= self.semantic_threshold:
                return True
        
    except Exception as e:
        # Fail closed: reject on error to prevent bypass
        logger.error(f"Semantic check failed, rejecting prompt: {e}")
        return True
    
    return False

How it works:

Initialization: Precomputes embeddings for all denylist entries
Runtime: Embeds user prompt using all-MiniLM-L6-v2
Comparison: Computes cosine similarity against denylist embeddings
Threshold: Rejects if similarity ≥ 0.7

Model Initialization:

def _init_semantic_model(self) -> None:
    """Load the sentence-transformers model and precompute denylist embeddings."""
    if not SEMANTIC_AVAILABLE:
        return
    if self._semantic_model is None:
        self._semantic_model = SentenceTransformer(self.semantic_model_name)
    
    self._entries = list(self.phrases) + list(self.words)
    if self._entries:
        embeds = self._semantic_model.encode(self._entries, convert_to_numpy=True)
        norms = np.linalg.norm(embeds, axis=1, keepdims=True)
        norms[norms == 0] = 1.0
        self._denylist_embeddings = embeds / norms  # Normalize for cosine similarity

Main Validation Method

The primary entry point for prompt validation:

def validate_or_reject(
    self,
    prompt: str,
    is_context: bool = False
) -> Tuple[bool, str]:
    """Validate a prompt and return approval status.
    
    Args:
        prompt: The prompt text to validate
        is_context: If True, skip semantic check (for context validation)
    
    Returns:
        Tuple of (is_allowed, rejection_message).
        is_allowed is True with empty message if prompt is allowed.
        is_allowed is False with message if prompt should be rejected.
    """
    # 1. Check programming patterns
    try:
        if self._matches_programming_regex(prompt):
            return False, PROGRAMMING_REJECTION_MESSAGE
    except Exception as e:
        # Fail closed: reject on error
        logger.error(f"Programming regex check failed, rejecting prompt: {e}")
        return False, REJECTION_MESSAGE
    
    # 2. Check keyword denylist
    if self.contains_denylist_keyword(prompt):
        return False, REJECTION_MESSAGE
    
    # 3. Skip semantic check for context validation
    if is_context:
        return True, ""
    
    # 4. Check semantic similarity
    try:
        if self._semantic_check(prompt):
            return False, REJECTION_MESSAGE
    except Exception as e:
        # Fail closed: reject on error
        logger.error(f"Semantic check failed, rejecting prompt: {e}")
        return False, REJECTION_MESSAGE
    
    return True, ""

Validation Flow:

Programming Pattern Check → If match, reject with programming-specific message
Keyword Denylist Check → If match, reject with standard message
Context Mode Skip → If is_context=True, skip semantic check (for validating tool results)
Semantic Similarity Check → If similarity ≥ threshold, reject with standard message
Pass → Return (True, "") if all checks pass

Rejection Messages

# Standard rejection message
REJECTION_MESSAGE = "I cannot help with that, maybe I can help you with a query regarding incidents"

# Programming-specific rejection
PROGRAMMING_REJECTION_MESSAGE = "I cannot help with programming, maybe I can help you with a query regarding incidents"

Usage Example

from src.copilot.guardrails.prompt_guardrails import PromptGuardrail

# Initialize guardrail
guardrail = PromptGuardrail(
    deny_words="politics,religion",
    denylist_path="/path/to/custom_denylist.json"
)

# Validate user prompt
user_prompt = "How do I fix the payment gateway timeout?"
is_allowed, rejection_msg = guardrail.validate_or_reject(user_prompt)

if is_allowed:
    # Process the prompt
    response = agent.invoke(user_prompt)
else:
    # Return rejection message
    print(rejection_msg)

Error Handling

All validation methods follow the fail-closed principle:

try:
    if self._semantic_check(prompt):
        return False, REJECTION_MESSAGE
except Exception as e:
    # Fail closed: reject on error to prevent bypass
    logger.error(f"Semantic check failed, rejecting prompt: {e}")
    return False, REJECTION_MESSAGE

Why fail closed? If semantic model loading fails, regex compilation errors, or any validation step throws an exception, the prompt is rejected. This prevents attackers from exploiting errors to bypass security checks.

Denylist Management

Loading Denylist

def _load_denylist(self) -> None:
    """Load denylist from file and settings."""
    raw = []
    
    # Load from JSON file
    try:
        with open(self.denylist_path, "r", encoding="utf-8") as fh:
            data = json.load(fh)
            if isinstance(data, dict) and "denylist" in data:
                raw = data.get("denylist", [])
            elif isinstance(data, list):
                raw = data
    except FileNotFoundError:
        logger.warning(f"Denylist file not found: {self.denylist_path}")
    except json.JSONDecodeError as e:
        logger.error(f"Invalid JSON in denylist file: {e}")
    
    # Merge with settings-based deny_words
    if self.deny_words:
        settings_words = [
            word.strip() for word in self.deny_words.split(",") if word.strip()
        ]
        raw.extend(settings_words)
    
    # Split into phrases and words
    entries = [str(x).strip().lower() for x in raw if x]
    self.phrases = [p for p in entries if " " in p]  # Multi-word
    self.words = set(p for p in entries if " " not in p)  # Single-word

Supported Formats:

// Format 1: Object with denylist key
{
  "denylist": ["word1", "phrase two", "word3"]
}

// Format 2: Array
["word1", "phrase two", "word3"]

Default Denylist Location

self.denylist_path = (
    Path(__file__).resolve().parents[3] / "data" / "denylist.json"
)

Relative to src/copilot/guardrails/prompt_guardrails.py, this resolves to data/denylist.json at the project root.

Performance Considerations

Lazy Loading

Semantic model is loaded lazily on first use:

if self._semantic_model is None:
    self._init_semantic_model()

Precomputed Embeddings

Denylist embeddings are computed once during initialization:

self._denylist_embeddings = embeds / norms  # Precomputed normalized embeddings

Runtime validation only needs to:

Encode the user prompt (fast)
Compute dot product with precomputed embeddings (fast)

Context Mode Optimization

When validating tool results (not user input), skip expensive semantic checks:

guardrail.validate_or_reject(tool_result, is_context=True)

LangGraph Implementation - How guardrails integrate with the agent
Agent Tools - Tools that receive validated prompts
LLM Factory - LLM configuration for the agent

Overview

Components

Configuration

Overview

Guardrail Class

Configuration

Validation Methods

1. Keyword Denylist

2. Programming Pattern Detection

3. Semantic Similarity Filter

Main Validation Method

Rejection Messages

Usage Example

Error Handling

Denylist Management

Loading Denylist

Default Denylist Location

Performance Considerations

Lazy Loading

Precomputed Embeddings

Context Mode Optimization

Build docs developers (and LLMs) love

Overview

Components

Configuration

​Overview

​Guardrail Class

​Configuration

​Validation Methods

​1. Keyword Denylist

​2. Programming Pattern Detection

​3. Semantic Similarity Filter

​Main Validation Method

​Rejection Messages

​Usage Example

​Error Handling

​Denylist Management

​Loading Denylist

​Default Denylist Location

​Performance Considerations

​Lazy Loading

​Precomputed Embeddings

​Context Mode Optimization

​Related Components

Build docs developers (and LLMs) love

Overview

Guardrail Class

Configuration

Validation Methods

1. Keyword Denylist

2. Programming Pattern Detection

3. Semantic Similarity Filter

Main Validation Method

Rejection Messages

Usage Example

Error Handling

Denylist Management

Loading Denylist

Default Denylist Location

Performance Considerations

Lazy Loading

Precomputed Embeddings

Context Mode Optimization

Related Components