Overview
The PromptGuardrail class provides security validation for user prompts before they’re processed by the agent. It filters inappropriate, out-of-scope, or potentially malicious queries using keyword denylists, semantic similarity, and pattern matching.
Security Policy: Fail ClosedAll guardrails fail CLOSED - if an error occurs during validation, the prompt is rejected. This prevents potentially malicious prompts from bypassing security checks due to errors.
Guardrail Class
Implementation in src/copilot/guardrails/prompt_guardrails.py:29:
class PromptGuardrail:
"""Guardrail that rejects prompts containing denylist keywords or programming requests.
Behavior:
- Treats single-word entries as token matches (word boundaries).
- Only rejects when a denylist phrase or token is present (case-insensitive).
- Fails CLOSED on errors: rejects prompts if validation cannot complete.
This intentionally allows vague prompts and misspellings; only exact denylist
entries trigger rejection.
"""
def __init__(
self,
deny_words: Optional[str] = None,
denylist_path: Optional[str] = None
) -> None:
"""Initialize the guardrail with denylist configuration.
Args:
deny_words: Comma-separated string of words to deny
denylist_path: Path to JSON file containing denylist
"""
self.deny_words = deny_words
if denylist_path:
self.denylist_path = Path(denylist_path)
else:
self.denylist_path = (
Path(__file__).resolve().parents[3] / "data" / "denylist.json"
)
self.semantic_threshold = 0.7
self.semantic_model_name = "all-MiniLM-L6-v2"
self._semantic_model: Optional[SentenceTransformer] = None
self._denylist_embeddings: Optional[np.ndarray] = None
self._entries: list = []
self._load_denylist()
if SEMANTIC_AVAILABLE:
try:
self._init_semantic_model()
except Exception as e:
logger.error(f"Failed to initialize semantic model: {e}")
self._semantic_model = None
Configuration
Parameters:
deny_words: Comma-separated string of additional words to deny (merged with file-based denylist)
denylist_path: Custom path to JSON denylist file (defaults to data/denylist.json)
Semantic Model Settings:
semantic_threshold: Cosine similarity threshold for semantic matching (default: 0.7)
semantic_model_name: Sentence transformer model (default: all-MiniLM-L6-v2)
Validation Methods
1. Keyword Denylist
Exact keyword and phrase matching (case-insensitive).
Implementation (src/copilot/guardrails/prompt_guardrails.py:101):
def contains_denylist_keyword(self, text: str) -> bool:
"""Check if text contains any denylist keywords.
Args:
text: The text to check
Returns:
True if text contains denylist keywords, False otherwise
"""
if not text:
return False
lowered = text.lower()
# Check phrases (multi-word entries)
for phrase in self.phrases:
if phrase in lowered:
return True
# Check words (single-word entries with token boundaries)
tokens = re.findall(r"\w+", lowered)
token_set = set(tokens)
if any(word in token_set for word in self.words):
return True
return False
Denylist Structure:
{
"denylist": [
"politics",
"election",
"violent content",
"explicit material"
]
}
Matching Logic:
- Phrases (multi-word): Substring match (e.g., “violent content” matches “create violent content”)
- Words (single-word): Token match with word boundaries (e.g., “politics” matches “politics” but not “geopolitics”)
2. Programming Pattern Detection
Rejects requests for code generation or programming help.
Implementation (src/copilot/guardrails/prompt_guardrails.py:126):
def _matches_programming_regex(self, text: str) -> bool:
"""Check if text matches programming/code-generation patterns.
Args:
text: The text to check
Returns:
True if text appears to be a programming request
"""
if not text:
return False
lowered = text.lower()
patterns = [
r"\bwrite\s+(a|an|the)?\s*.*\b(program|script|application|example|code)\b",
r"\b(example|sample|show|provide)\b.*\b(code|program|script)\b",
r"\b(c\+\+|cpp|java|python|golang|go|rust|javascript|ts|typescript)\b",
r"\bhow to implement\b",
r"\b(read|open|write)\s+(a\s+)?file\b",
]
for p in patterns:
if re.search(p, lowered):
return True
return False
Blocked Patterns:
- “write a program/script/application”
- “show me example code”
- Programming language names (Python, Java, C++, etc.)
- “how to implement”
- File I/O operations (“read a file”, “write to file”)
Custom Rejection Message:
PROGRAMMING_REJECTION_MESSAGE = "I cannot help with programming, maybe I can help you with a query regarding incidents"
3. Semantic Similarity Filter
Detects prompts semantically similar to denylist entries, catching paraphrases and variations.
Implementation (src/copilot/guardrails/prompt_guardrails.py:164):
def _semantic_check(self, text: str) -> bool:
"""Perform semantic similarity check against denylist entries.
Args:
text: The text to check
Returns:
True if text is semantically similar to denylist entries (should reject)
"""
if not SEMANTIC_AVAILABLE:
return False
try:
if self._semantic_model is None:
self._init_semantic_model()
if self._semantic_model is None:
# Model initialization failed - fail closed
logger.warning("Semantic model unavailable, failing closed")
return True
# Encode the text
emb = self._semantic_model.encode([text], convert_to_numpy=True)
norm = np.linalg.norm(emb, axis=1, keepdims=True)
norm[norm == 0] = 1.0
emb = emb / norm
# Compare with denylist embeddings
if getattr(self, "_denylist_embeddings", None) is not None:
sims = np.dot(self._denylist_embeddings, emb.T).squeeze()
max_sim = float(np.max(sims)) if sims.size else 0.0
if max_sim >= self.semantic_threshold:
return True
except Exception as e:
# Fail closed: reject on error to prevent bypass
logger.error(f"Semantic check failed, rejecting prompt: {e}")
return True
return False
How it works:
- Initialization: Precomputes embeddings for all denylist entries
- Runtime: Embeds user prompt using
all-MiniLM-L6-v2
- Comparison: Computes cosine similarity against denylist embeddings
- Threshold: Rejects if similarity ≥ 0.7
Model Initialization:
def _init_semantic_model(self) -> None:
"""Load the sentence-transformers model and precompute denylist embeddings."""
if not SEMANTIC_AVAILABLE:
return
if self._semantic_model is None:
self._semantic_model = SentenceTransformer(self.semantic_model_name)
self._entries = list(self.phrases) + list(self.words)
if self._entries:
embeds = self._semantic_model.encode(self._entries, convert_to_numpy=True)
norms = np.linalg.norm(embeds, axis=1, keepdims=True)
norms[norms == 0] = 1.0
self._denylist_embeddings = embeds / norms # Normalize for cosine similarity
Main Validation Method
The primary entry point for prompt validation:
def validate_or_reject(
self,
prompt: str,
is_context: bool = False
) -> Tuple[bool, str]:
"""Validate a prompt and return approval status.
Args:
prompt: The prompt text to validate
is_context: If True, skip semantic check (for context validation)
Returns:
Tuple of (is_allowed, rejection_message).
is_allowed is True with empty message if prompt is allowed.
is_allowed is False with message if prompt should be rejected.
"""
# 1. Check programming patterns
try:
if self._matches_programming_regex(prompt):
return False, PROGRAMMING_REJECTION_MESSAGE
except Exception as e:
# Fail closed: reject on error
logger.error(f"Programming regex check failed, rejecting prompt: {e}")
return False, REJECTION_MESSAGE
# 2. Check keyword denylist
if self.contains_denylist_keyword(prompt):
return False, REJECTION_MESSAGE
# 3. Skip semantic check for context validation
if is_context:
return True, ""
# 4. Check semantic similarity
try:
if self._semantic_check(prompt):
return False, REJECTION_MESSAGE
except Exception as e:
# Fail closed: reject on error
logger.error(f"Semantic check failed, rejecting prompt: {e}")
return False, REJECTION_MESSAGE
return True, ""
Validation Flow:
- Programming Pattern Check → If match, reject with programming-specific message
- Keyword Denylist Check → If match, reject with standard message
- Context Mode Skip → If
is_context=True, skip semantic check (for validating tool results)
- Semantic Similarity Check → If similarity ≥ threshold, reject with standard message
- Pass → Return
(True, "") if all checks pass
Rejection Messages
# Standard rejection message
REJECTION_MESSAGE = "I cannot help with that, maybe I can help you with a query regarding incidents"
# Programming-specific rejection
PROGRAMMING_REJECTION_MESSAGE = "I cannot help with programming, maybe I can help you with a query regarding incidents"
Usage Example
from src.copilot.guardrails.prompt_guardrails import PromptGuardrail
# Initialize guardrail
guardrail = PromptGuardrail(
deny_words="politics,religion",
denylist_path="/path/to/custom_denylist.json"
)
# Validate user prompt
user_prompt = "How do I fix the payment gateway timeout?"
is_allowed, rejection_msg = guardrail.validate_or_reject(user_prompt)
if is_allowed:
# Process the prompt
response = agent.invoke(user_prompt)
else:
# Return rejection message
print(rejection_msg)
Error Handling
All validation methods follow the fail-closed principle:
try:
if self._semantic_check(prompt):
return False, REJECTION_MESSAGE
except Exception as e:
# Fail closed: reject on error to prevent bypass
logger.error(f"Semantic check failed, rejecting prompt: {e}")
return False, REJECTION_MESSAGE
Why fail closed?
If semantic model loading fails, regex compilation errors, or any validation step throws an exception, the prompt is rejected. This prevents attackers from exploiting errors to bypass security checks.
Denylist Management
Loading Denylist
def _load_denylist(self) -> None:
"""Load denylist from file and settings."""
raw = []
# Load from JSON file
try:
with open(self.denylist_path, "r", encoding="utf-8") as fh:
data = json.load(fh)
if isinstance(data, dict) and "denylist" in data:
raw = data.get("denylist", [])
elif isinstance(data, list):
raw = data
except FileNotFoundError:
logger.warning(f"Denylist file not found: {self.denylist_path}")
except json.JSONDecodeError as e:
logger.error(f"Invalid JSON in denylist file: {e}")
# Merge with settings-based deny_words
if self.deny_words:
settings_words = [
word.strip() for word in self.deny_words.split(",") if word.strip()
]
raw.extend(settings_words)
# Split into phrases and words
entries = [str(x).strip().lower() for x in raw if x]
self.phrases = [p for p in entries if " " in p] # Multi-word
self.words = set(p for p in entries if " " not in p) # Single-word
Supported Formats:
// Format 1: Object with denylist key
{
"denylist": ["word1", "phrase two", "word3"]
}
// Format 2: Array
["word1", "phrase two", "word3"]
Default Denylist Location
self.denylist_path = (
Path(__file__).resolve().parents[3] / "data" / "denylist.json"
)
Relative to src/copilot/guardrails/prompt_guardrails.py, this resolves to data/denylist.json at the project root.
Lazy Loading
Semantic model is loaded lazily on first use:
if self._semantic_model is None:
self._init_semantic_model()
Precomputed Embeddings
Denylist embeddings are computed once during initialization:
self._denylist_embeddings = embeds / norms # Precomputed normalized embeddings
Runtime validation only needs to:
- Encode the user prompt (fast)
- Compute dot product with precomputed embeddings (fast)
Context Mode Optimization
When validating tool results (not user input), skip expensive semantic checks:
guardrail.validate_or_reject(tool_result, is_context=True)