Overview
The Secure MCP Gateway exports comprehensive metrics via OpenTelemetry to Prometheus. These metrics provide insights into:- Operation Performance: Request rates, latencies, success/failure rates
- Cache Efficiency: Hit/miss ratios, cache sizes
- Security Events: Guardrail violations, blocked requests, PII redactions
- Resource Usage: Active sessions, users, timeout operations
- System Health: Authentication success/failure, error rates
Metrics Architecture
Available Metrics
Operation Metrics
Tool Call Counters
Total number of tool invocations across all servers.Labels:
server_name, tool_name, project_idUsage: Track overall tool usage and identify popular toolsTotal number of successful tool calls.Labels:
server_name, tool_nameUsage: Calculate success rates, identify reliable toolsTotal number of failed tool calls (e.g., server errors, timeouts).Labels:
server_name, tool_name, error_typeUsage: Monitor error rates, set up alerts for high failure ratesTotal number of tool call errors (exceptions, crashes).Labels:
server_name, error_typeUsage: Track critical errors requiring immediate attentionTool Call Latency
Duration of tool calls in seconds. Includes percentiles (p50, p95, p99).Labels:
server_name, tool_nameBuckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, +InfUsage: Identify slow tools, set SLOs, detect performance degradationServer Discovery
Number of times
enkrypt_list_all_servers was called.Labels: user_id, project_idUsage: Track server listing frequencyTotal number of servers discovered with tools.Labels:
mcp_config_idUsage: Monitor server discovery operationsCache Metrics
Total number of cache hits (data found in cache).Labels:
cache_type (tools, gateway_config, server_config)Usage: Monitor cache effectivenessTotal number of cache misses (data not in cache, fetch required).Labels:
cache_typeUsage: Identify cache tuning opportunitiesSecurity Metrics
Guardrail Violations
Total number of guardrail violations detected.Labels:
violation_type, server_name, detectorUsage: Monitor security posture, detect attack patternsGuardrail violations on input (before sending to server).Labels:
violation_type, server_nameUsage: Track input validation issuesGuardrail violations on output (after receiving from server).Labels:
violation_type, server_nameUsage: Monitor response quality and safetyRelevancy check violations (response not relevant to input).Labels:
server_name, tool_nameAdherence check violations (response doesn’t follow instructions).Labels:
server_name, tool_nameHallucination detection violations.Labels:
server_name, tool_nameBlocked Requests
Total number of tool calls blocked by guardrails.Labels:
server_name, tool_name, reasonUsage: Monitor security enforcement, identify attack attemptsTotal number of PII redaction operations performed.Labels:
server_name, pii_typeUsage: Track PII protection, ensure complianceGuardrail API Performance
Total number of API requests to guardrail service.Labels:
endpoint, status_codeUsage: Monitor guardrail service usageDuration of guardrail API requests in seconds.Labels:
endpointBuckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, +InfUsage: Monitor guardrail latency, set alerts for slow checksAuthentication Metrics
Total number of successful authentications.Labels:
auth_provider, project_idUsage: Monitor authentication activityTotal number of failed authentication attempts.Labels:
auth_provider, failure_reasonUsage: Detect unauthorized access attempts, brute force attacksCurrent number of active sessions.Usage: Monitor concurrent connections
Current number of active users.Usage: Track user concurrency
Timeout Management Metrics
Total number of timeout operations tracked.Labels:
operation_typeUsage: Monitor timeout tracking coverageNumber of operations completed successfully before timeout.Labels:
operation_typeNumber of operations that exceeded timeout threshold.Labels:
operation_typeUsage: Set alerts for high timeout ratesNumber of operations that were cancelled.Labels:
operation_typeTimeout Escalations
Number of timeout escalation warnings (>80% of timeout).Labels:
operation_typeUsage: Early warning for slow operationsNumber of operations that reached timeout threshold.Labels:
operation_typeNumber of operations that exceeded timeout and failed.Labels:
operation_typeTimeout Performance
Duration of timeout-managed operations in seconds.Labels:
operation_typeBuckets: 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, +InfNumber of currently active timeout operations.Usage: Monitor concurrent operations
Metrics Implementation
Creating Metrics
Metrics are created during OpenTelemetry provider initialization: Location:src/secure_mcp_gateway/plugins/telemetry/opentelemetry_provider.py:373
Recording Metrics
Metrics are recorded throughout the gateway:Prometheus Configuration
Scrape Configuration
Location:infra/prometheus/prometheus.yml
Accessing Prometheus
- UI: http://localhost:9090
- API: http://localhost:9090/api/v1/query
- Targets: http://localhost:9090/targets
- Config: http://localhost:9090/config
Example Queries
In Prometheus UI (Expression Browser):Grafana Dashboards
Pre-built Dashboards
The gateway includes two pre-configured Grafana dashboards:1. Gateway Metrics Dashboard
Location:infra/grafana/provisioning/dashboards/gateway-metrics.json
Panels:
- Tool Call Rate (graph)
- Tool Call Success Rate (gauge)
- Tool Call Latency p95 (graph)
- Cache Hit Ratio (gauge)
- Guardrail Violations (graph)
- Active Sessions (gauge)
- Error Rate (graph)
2. OpenTelemetry Gateway Metrics Dashboard
Location:infra/grafana/provisioning/dashboards/OpenTelemetry Gateway Metrics.json
Panels:
- Request Volume
- Response Times (percentiles)
- Error Rates by Type
- Throughput
- System Health
Accessing Grafana
- Open http://localhost:3000
- No login required (anonymous admin mode)
- Navigate to Dashboards → Browse
- Select “Gateway Metrics” or “OpenTelemetry Gateway Metrics”
Creating Custom Dashboards
Example Panel (Tool Call Rate):Alerting
Prometheus Alerts
Create alert rules in Prometheus: alerting_rules.yml:Grafana Alerts
Create alerts in Grafana dashboards:- Edit panel → Alert tab
- Create alert rule
- Configure notification channels (Slack, PagerDuty, email)
Best Practices
Monitor Key SLIs
Monitor Key SLIs
Focus on Service Level Indicators:
- Availability: Success rate > 99.9%
- Latency: p95 < 500ms, p99 < 1s
- Error Rate: < 0.1%
- Cache Hit Ratio: > 80%
Set Up Alerts
Set Up Alerts
Configure alerts for:
- High error rates
- Slow operations (p95 > threshold)
- Security events (guardrail violations)
- Resource exhaustion (high active sessions)
Use Labels Wisely
Use Labels Wisely
Avoid high-cardinality labels:
- ✅ Good:
server_name,tool_name,project_id - ❌ Bad:
user_id,request_id,timestamp
Analyze Trends
Analyze Trends
Use histograms for distribution analysis:
Monitor Security Metrics
Monitor Security Metrics
Track security events:
- Guardrail violations by type
- Blocked requests over time
- PII redaction frequency
- Authentication failures
Troubleshooting
Metrics Not Appearing in Prometheus
-
Check collector metrics endpoint:
- Verify Prometheus scrape targets: http://localhost:9090/targets
-
Check collector logs:
High Cardinality Issues
Symptom: Prometheus using excessive memory Solution: Reduce label cardinalityDashboards Not Loading
-
Check Grafana logs:
- Verify datasource connection: Grafana → Connections → Data sources → Prometheus → Test
-
Check dashboard JSON:
Next Steps
Logging
Configure structured logging and log aggregation
OpenTelemetry Setup
Set up OTLP export and distributed tracing
Overview
Return to observability overview
API Reference
Explore the monitoring API