Production Best Practices - Agent Identity Protocol

This guide covers operational best practices for running AIP in production. Whether you’re deploying locally or in Kubernetes, these recommendations will help you harden security, maintain auditability, and ensure high availability.

Security Hardening

Policy File Protection

The policy file is the root of trust for AIP. Compromising it means bypassing all authorization checks.

Critical: AIP automatically protects the policy file from agent access, but you must secure it at the filesystem and deployment level.

Local Deployment

# Set restrictive permissions (owner read-only)
chmod 400 /etc/aip/agent.yaml
chown root:root /etc/aip/agent.yaml

# Verify
ls -la /etc/aip/agent.yaml
# Expected: -r-------- 1 root root 1234 Mar 03 10:00 agent.yaml

Kubernetes Deployment

Store policies in Secrets instead of ConfigMaps (encrypted at rest):

apiVersion: v1
kind: Secret
metadata:
  name: agent-policy
  namespace: default
type: Opaque
stringData:
  policy.yaml: |
    apiVersion: aip.io/v1alpha1
    kind: AgentPolicy
    metadata:
      name: production-agent
    spec:
      mode: enforce
      allowed_tools:
        - read_file

Mount as read-only volume:

volumeMounts:
  - name: aip-policy
    mountPath: /etc/aip
    readOnly: true  # Prevent writes
volumes:
  - name: aip-policy
    secret:
      secretName: agent-policy
      defaultMode: 0400  # Read-only for owner

Policy Signing (v1alpha2)

Sign policies to prevent tampering:

# Generate signing key
openssl genpkey -algorithm Ed25519 -out policy-signing-key.pem

# Sign policy
aip sign-policy --key policy-signing-key.pem --policy agent.yaml > agent.signed.yaml

The signed policy includes a cryptographic signature in metadata:

metadata:
  name: production-agent
  signature: "ed25519:YWJjZGVm..."

AIP verifies the signature on load. Unsigned or tampered policies are rejected.

Store signing keys in a secrets manager (AWS Secrets Manager, HashiCorp Vault, Kubernetes Secrets with encryption at rest).

Principle of Least Privilege

Start with minimal permissions and expand only as needed:

Deploy with an empty allowlist

spec:
  mode: monitor  # Don't block yet
  allowed_tools: []  # Deny all by default

Run for 24-48 hours and collect audit logs

# Identify all tools the agent attempted to use
cat aip-audit.jsonl | jq -r '.tool' | sort | uniq

Add only the necessary tools to allowed_tools

spec:
  mode: enforce  # Now block violations
  allowed_tools:
    - read_file
    - list_directory
    - github_get_repo  # Only what's needed

Review logs monthly and prune unused tools

# Find tools allowed but never used in 30 days
cat aip-audit.jsonl | jq -r 'select(.timestamp > "2026-02-01") | .tool' | sort | uniq

DLP Pattern Hardening

Use comprehensive DLP patterns to prevent exfiltration:

spec:
  dlp:
    enabled: true
    patterns:
      # API Keys and Tokens
      - name: "AWS Key"
        regex: "(A3T[A-Z0-9]|AKIA|AGPA|AIDA|AROA|AIPA|ANPA|ANVA|ASIA)[A-Z0-9]{16}"
      
      - name: "GitHub Token"
        regex: "gh[pousr]_[a-zA-Z0-9]{36,255}"
      
      - name: "OpenAI API Key"
        regex: "sk-[a-zA-Z0-9]{48}"
      
      - name: "Stripe Key"
        regex: "sk_(test|live)_[a-zA-Z0-9]{24,99}"
      
      # PII
      - name: "Email Address"
        regex: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
      
      - name: "SSN"
        regex: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
      
      - name: "Credit Card"
        regex: "\\b(?:\\d{4}[- ]?){3}\\d{4}\\b"
      
      # Private Keys
      - name: "Private Key"
        regex: "-----BEGIN (RSA |EC |DSA |OPENSSH )?PRIVATE KEY-----"
      
      # Internal URLs
      - name: "Internal IP"
        regex: "\\b(10\\.|172\\.(1[6-9]|2[0-9]|3[01])\\.|192\\.168\\.)\\d{1,3}\\.\\d{1,3}\\b"

DLP patterns must use RE2 syntax (not PCRE). Test patterns before deployment:

echo "AKIAIOSFODNN7EXAMPLE" | grep -P "AKIA[A-Z0-9]{16}"

Protected Paths

Block access to sensitive files and directories:

spec:
  protected_paths:
    # SSH keys
    - ~/.ssh
    - /root/.ssh
    
    # Cloud credentials
    - ~/.aws/credentials
    - ~/.config/gcloud
    - ~/.azure
    
    # Environment files
    - .env
    - .env.local
    - .env.production
    
    # Package manager credentials
    - ~/.npmrc
    - ~/.pypirc
    - ~/.docker/config.json
    
    # Database credentials
    - /etc/postgresql
    - /var/lib/mysql

AIP automatically blocks any tool argument containing these paths.

Policy Management

Versioning and Change Control

Treat policies as infrastructure-as-code:

metadata:
  name: production-agent
  version: "2.1.0"  # Semantic versioning
  owner: [email protected]
spec:
  # ...

Recommended workflow:

Store policies in Git
Require pull requests for changes
Run conformance tests in CI/CD
Tag releases (e.g., v2.1.0)
Deploy via GitOps (ArgoCD, Flux)

Policy Review Cadence

Frequency	Review Type	Action
Weekly	Audit log review	Check for unexpected tool usage
Monthly	Permission pruning	Remove unused tools from `allowed_tools`
Quarterly	DLP pattern updates	Add new secret patterns (e.g., new API key formats)
Annually	Full security audit	External review of policy logic

Environment-Specific Policies

Use separate policies for dev, staging, and production:

# Development (monitor mode)
policies/dev/agent.yaml

# Staging (enforce with logging)
policies/staging/agent.yaml

# Production (enforce + DLP + rate limits)
policies/production/agent.yaml

Example production policy (stricter than dev):

# policies/production/agent.yaml
apiVersion: aip.io/v1alpha1
kind: AgentPolicy
metadata:
  name: production-agent
  version: "2.0.0"
spec:
  mode: enforce  # Never use monitor in production
  allowed_tools:
    - read_file
    - github_get_repo
  tool_rules:
    - tool: github_create_pull
      action: ask  # Human approval required
      rate_limit: "5/hour"
  dlp:
    enabled: true
    patterns:
      - name: "Production Database URL"
        regex: "postgres://.*@prod-db\\."

Monitoring and Alerting

Audit Log Monitoring

Critical Events to Alert On

# Alert if >10 blocked requests in 5 minutes
cat aip-audit.jsonl | jq -r 'select(.decision == "BLOCK")' | wc -l

Prometheus Alerting Rules

For Kubernetes deployments:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: aip-alerts
  namespace: aip-system
spec:
  groups:
    - name: aip_policy_violations
      interval: 1m
      rules:
        - alert: AIPHighBlockRate
          expr: rate(aip_policy_decisions{decision="BLOCK"}[5m]) > 0.1
          for: 5m
          annotations:
            summary: "High rate of blocked agent requests"
            description: "Agent {{ $labels.agent }} has {{ $value }} blocked requests/sec"
        
        - alert: AIPDLPTriggered
          expr: increase(aip_dlp_redactions_total[5m]) > 0
          for: 1m
          annotations:
            summary: "DLP pattern matched in agent response"
            description: "Rule {{ $labels.rule }} triggered in agent {{ $labels.agent }}"
        
        - alert: AIPPolicyLoadFailed
          expr: aip_policy_load_errors_total > 0
          for: 1m
          annotations:
            summary: "AIP sidecar failed to load policy"
            description: "Check policy syntax and signature"

Centralized Logging

Forward audit logs to a SIEM or log aggregator:

Splunk

# Forward JSONL logs to Splunk HEC
curl -X POST https://splunk.company.com:8088/services/collector \
  -H "Authorization: Splunk <HEC-TOKEN>" \
  -d @aip-audit.jsonl

Elasticsearch

# Bulk import to Elasticsearch
cat aip-audit.jsonl | jq -c '{"index": {"_index": "aip-audit"}}, .' | \
  curl -X POST http://elasticsearch:9200/_bulk \
    -H 'Content-Type: application/x-ndjson' \
    --data-binary @-

AWS CloudWatch

# Stream logs to CloudWatch
aws logs put-log-events \
  --log-group-name /aip/audit \
  --log-stream-name $(hostname) \
  --log-events file://aip-audit.jsonl

Compliance Reporting

Generate compliance reports from audit logs:

# SOC 2: Who accessed what, when?
cat aip-audit.jsonl | jq -r '[.timestamp, .tool, .decision] | @csv'

# GDPR: All actions on customer data
cat aip-audit.jsonl | jq 'select(.args.user_id != null)'

# HIPAA: All access to PHI
cat aip-audit.jsonl | jq 'select(.tool == "read_patient_record")'

High Availability

Local Deployment HA

For critical agents, run AIP in a supervisor that restarts on failure:

# systemd unit file: /etc/systemd/system/aip-proxy.service
[Unit]
Description=AIP Proxy for Production Agent
After=network.target

[Service]
Type=simple
User=aip
ExecStart=/usr/local/bin/aip --target "python /opt/agent/server.py" --policy /etc/aip/agent.yaml
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Enable:

sudo systemctl enable aip-proxy
sudo systemctl start aip-proxy

Kubernetes HA

Sidecars inherit pod-level HA from Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent
spec:
  replicas: 3  # Multiple instances for HA
  template:
    spec:
      containers:
        - name: aip-proxy
          livenessProbe:
            httpGet:
              path: /healthz
              port: 9091
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /readyz
              port: 9091
            initialDelaySeconds: 3
            periodSeconds: 5

Failure Modes

Configure fail-closed behavior:

spec:
  mode: enforce
  failover:
    on_policy_load_error: block  # Block all if policy invalid
    on_regex_error: block        # Block if regex patterns fail to compile
    on_dlp_error: allow          # Allow if DLP scanner crashes (performance vs security)

Never use failover: allow in production. Always fail-closed to prevent bypass.

Key Rotation and Credential Management

Policy Signing Key Rotation

Generate new signing key

openssl genpkey -algorithm Ed25519 -out policy-signing-key-v2.pem

Re-sign all policies with new key

for policy in policies/*.yaml; do
  aip sign-policy --key policy-signing-key-v2.pem --policy $policy > ${policy}.signed
done

Update AIP proxy to trust both keys (grace period)

aip --policy agent.yaml --trusted-keys key-v1.pub,key-v2.pub

After 30 days, remove old key

aip --policy agent.yaml --trusted-keys key-v2.pub

MCP Server Credential Rotation

When rotating credentials for MCP servers (e.g., GitHub tokens):

Update DLP patterns to detect the old token format
Rotate the secret in your secrets manager
Restart agents to pick up new credentials
Monitor audit logs for old token usage (should be zero)

# Add old token format to DLP to detect leaks
dlp:
  patterns:
    - name: "GitHub Token (OLD - REVOKED)"
      regex: "ghp_OldTokenPattern123456"

Performance Optimization

Policy Evaluation Latency

AIP policy evaluation is designed to be less than 1ms per request. Optimize further:

Reduce Regex Complexity

# Slow (catastrophic backtracking risk)
allow_args:
  url: "(https?://)?(www\\.)?(github\\.com|gitlab\\.com|bitbucket\\.org)/.*"

# Fast (linear time)
allow_args:
  url: "^https://github\\.com/[a-zA-Z0-9_-]+/[a-zA-Z0-9_-]+$"

Cache Policy Evaluation Results

For Kubernetes deployments with high request volume:

sidecar:
  cache:
    enabled: true
    ttl: "5m"  # Cache allow/block decisions
    max_entries: 10000

Only enable caching for deterministic policies (no action: ask or time-based rules).

DLP Scanning Performance

DLP scanning can add latency for large responses:

dlp:
  enabled: true
  max_scan_size: "1MB"  # Skip DLP for responses >1MB
  timeout: "100ms"      # Fail-open if scan takes >100ms

Audit Log Performance

Rotate logs frequently to prevent I/O bottlenecks:

# Rotate daily
0 0 * * * mv /var/log/aip/audit.jsonl /var/log/aip/audit-$(date +\%Y\%m\%d).jsonl

Use async log shipping to avoid blocking requests:

audit:
  async: true  # Write logs in background thread
  buffer_size: 1000  # Buffer up to 1000 entries

Disaster Recovery

Backup Policies

Store policy backups in version control and object storage:

# Automated daily backup
0 2 * * * kubectl get agentpolicies -A -o yaml > /backups/aip-policies-$(date +\%Y\%m\%d).yaml
0 2 * * * aws s3 cp /backups/aip-policies-$(date +\%Y\%m\%d).yaml s3://backups/aip/

Audit Log Retention

Comply with regulatory requirements:

Regulation	Minimum Retention	Recommended Storage
SOC 2	1 year	S3 Glacier
GDPR	6 months	Encrypted EBS/S3
HIPAA	6 years	WORM storage (AWS S3 Object Lock)
SOX	7 years	Immutable storage

Example S3 lifecycle policy:

{
  "Rules": [
    {
      "Id": "archive-aip-logs",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555
      }
    }
  ]
}

Recovery Testing

Quarterly DR drills:

Delete all policies in staging environment
Restore from backup within SLA (e.g., 15 minutes)
Verify audit logs are intact and queryable
Test agent functionality after restore

Security Incident Response

Suspected Policy Bypass

Immediately switch to monitor mode

kubectl patch agentpolicy production-agent --type=merge -p '{"spec":{"mode":"monitor"}}'

This logs all requests without blocking (preserves forensic evidence).

Export audit logs for analysis

kubectl logs -l app=agent -c aip-proxy --since=24h > incident-logs.jsonl

Identify the bypass vector

# Look for unexpected tools being allowed
cat incident-logs.jsonl | jq 'select(.decision == "ALLOW" and (.tool | IN("exec_command", "delete_file")))'

Patch the policy and redeploy

# Add explicit block rule
kubectl edit agentpolicy production-agent
# Switch back to enforce mode
kubectl patch agentpolicy production-agent --type=merge -p '{"spec":{"mode":"enforce"}}'

DLP Leak Detection

If DLP detects a secret in agent output:

Revoke the secret immediately (GitHub token, API key, etc.)
Audit all requests from that agent in the past 7 days
Check for exfiltration (did the secret appear in logs, external services?)
Root cause analysis: Why did the agent access the secret?

Operational Checklists

Pre-Deployment Checklist

Policy file has restrictive permissions (chmod 400)
Policy is signed with Ed25519 key
DLP patterns cover all known secret formats
Audit logging is enabled and forwarded to SIEM
Alerts configured for policy violations and DLP triggers
mode: enforce is set (not monitor)
Protected paths include SSH keys, cloud credentials, .env files
Backup and recovery procedures tested

Monthly Review Checklist

Review blocked requests for false positives
Prune unused tools from allowed_tools
Update DLP patterns for new secret formats
Verify audit log retention meets compliance requirements
Check for policy drift (prod vs staging)
Review and rotate policy signing keys if >90 days old

Quarterly Security Audit Checklist

Penetration test: Attempt policy bypass
Review all action: ask approvals (were they legitimate?)
Analyze DLP redaction patterns (new threat vectors?)
Validate NetworkPolicy rules still align with AIP policy
Disaster recovery drill (restore policies and logs from backup)
External security review of policy logic

Next Steps

Local Deployment

Run AIP on your local machine

Kubernetes Deployment

Deploy AIP in Kubernetes with sidecar pattern

Policy Reference

Complete policy schema and examples

Security Policy

Report vulnerabilities responsibly

Get Started

Core Concepts

Guides

Deployment

​Security Hardening

​Policy File Protection

​Local Deployment

​Kubernetes Deployment

​Policy Signing (v1alpha2)

​Principle of Least Privilege

​DLP Pattern Hardening

​Protected Paths

​Policy Management

​Versioning and Change Control

​Policy Review Cadence

​Environment-Specific Policies

​Monitoring and Alerting

​Audit Log Monitoring

​Critical Events to Alert On

​Prometheus Alerting Rules

​Centralized Logging

​Splunk

​Elasticsearch

​AWS CloudWatch

​Compliance Reporting

​High Availability

​Local Deployment HA

​Kubernetes HA

​Failure Modes

​Key Rotation and Credential Management

​Policy Signing Key Rotation

​MCP Server Credential Rotation

​Performance Optimization

​Policy Evaluation Latency

​Reduce Regex Complexity

​Cache Policy Evaluation Results

​DLP Scanning Performance

​Audit Log Performance

​Disaster Recovery

​Backup Policies

​Audit Log Retention

​Recovery Testing

​Security Incident Response

​Suspected Policy Bypass

​DLP Leak Detection

​Operational Checklists

​Pre-Deployment Checklist

​Monthly Review Checklist

​Quarterly Security Audit Checklist

​Next Steps

Local Deployment

Kubernetes Deployment

Policy Reference

Security Policy

Build docs developers (and LLMs) love

Security Hardening

Policy File Protection

Local Deployment

Kubernetes Deployment

Policy Signing (v1alpha2)

Principle of Least Privilege

DLP Pattern Hardening

Protected Paths

Policy Management

Versioning and Change Control

Policy Review Cadence

Environment-Specific Policies

Monitoring and Alerting

Audit Log Monitoring

Critical Events to Alert On

Prometheus Alerting Rules

Centralized Logging

Splunk

Elasticsearch

AWS CloudWatch

Compliance Reporting

High Availability

Local Deployment HA

Kubernetes HA

Failure Modes

Key Rotation and Credential Management

Policy Signing Key Rotation

MCP Server Credential Rotation

Performance Optimization

Policy Evaluation Latency

Reduce Regex Complexity

Cache Policy Evaluation Results

DLP Scanning Performance

Audit Log Performance

Disaster Recovery

Backup Policies

Audit Log Retention

Recovery Testing

Security Incident Response

Suspected Policy Bypass

DLP Leak Detection

Operational Checklists

Pre-Deployment Checklist

Monthly Review Checklist

Quarterly Security Audit Checklist

Next Steps