Taxonomy
Direct prompt injection
The user directly inputs malicious instructions into the model’s input field.
Indirect (third-party) injection
Attacker instructions are embedded in external content (web pages, documents, emails) that an agent processes.
Prompt leaking
Tricks the model into revealing its system prompt or confidential instructions.
Jailbreak
Bypasses safety training to produce content the model would normally refuse.
Direct injection techniques
Authority assertion / rule override
Context switching / storytelling
Dual persona (DAN and variants)
Encoding and obfuscation
Payload splitting
Breaking a disallowed request across multiple messages:Translation trick
Indirect prompt injection
When an LLM agent can browse the web, read files, or use plugins, attackers can place injected prompts in content the agent will process.Web page injection
Common hiding techniques in HTML/CSS
| Technique | Example |
|---|---|
| Zero-size text | font-size: 0 / line-height: 0 |
| Invisible colour | color: white on white background |
| Off-screen | position: absolute; left: -9999px |
| Hidden elements | display: none / visibility: hidden |
| SVG CDATA | Instructions in <![CDATA[...]]> blocks |
data-* attributes | Instructions in HTML attributes |
| Runtime assembly | Base64 payload decoded by JavaScript after page load |
GitHub Copilot injection (picture tag smuggling)
Trail of Bits demonstrated that Copilot’s coding agent processes GitHub Issue text verbatim. An attacker can hide a prompt inside a<picture> tag (which GitHub renders as empty) and inject instructions to backdoor the generated code:
- Add a legitimate-looking dependency
- Modify the lock file to point to an attacker-controlled package
- Include a backdoor that triggers on a specific HTTP header
Prompt leaking
- Proprietary business logic in the system prompt
- API keys or credentials embedded in instructions
- Information about connected tools and capabilities
Agentic browsing attack chain
For LLMs with web browsing capabilities:Seed injection in indexed content
Place attacker instructions in a user-generated area of a trusted domain (e.g., blog comments). When the agent summarises the article, it reads the comments and may execute the injection.
Link exfiltration via trusted redirectors
Some platforms (e.g., Bing) use trusted URL redirectors. Pre-index attacker pages, one per alphabet character, then exfiltrate secrets by having the agent render sequences of redirector links.
Defences
Privilege separation
Privilege separation
Ensure system prompts are truly privileged and cannot be overridden by user input. Use model providers that enforce system/user prompt separation at the API level.
Input sanitisation for agentic pipelines
Input sanitisation for agentic pipelines
Strip or neutralise known injection patterns from any external content before it is added to the model context. Treat all external data as untrusted.
Output filtering
Output filtering
Check model outputs for unexpected patterns: encoded data, exfiltration URLs, code that was not requested, or disclosure of internal state.
Prompt Guard / classifier models
Prompt Guard / classifier models
Use a dedicated safety classifier (e.g., Llama Prompt Guard 2) to screen user inputs before they reach the main model. These models are trained to detect injection attempts.
Least-privilege for agentic tools
Least-privilege for agentic tools
Scope tool permissions tightly. An agent that only needs to read documents should not have write or network access.
Monitor for anomalous outputs
Monitor for anomalous outputs
Alert if the model’s output contains unexpected phrases, exfiltration-style URLs, or content inconsistent with the user’s query.
Tools
| Tool | Purpose |
|---|---|
| promptmap | Automated prompt injection testing |
| garak | LLM vulnerability scanner |
| PyRIT | Microsoft’s Python Risk Identification Toolkit for generative AI |
| Adversarial Robustness Toolbox | IBM ART — covers adversarial ML broadly |