Observability
Scope: Conditional (applies when observability patterns are used)Rule ID:
hatch3r-observability
Defines observability standards including structured logging, distributed tracing with OpenTelemetry, metrics collection, SLO/SLI definitions, alerting, and error reporting.
Structured Logging
Use Structured JSON Logging
Log Levels
| Level | When to Use |
|---|---|
error | Failures that require immediate attention |
warn | Degraded state or unexpected conditions |
info | State changes, business events |
debug | Development-only detailed info |
Required Fields
Every log entry must include:Never Log Sensitive Data
Client-Side Logging
Performance Budget for Logging
Logging must not add > 10ms latency to hot paths. Use async logging and batching:Log Sampling
For high-volume debug logs in production:Distributed Tracing
OpenTelemetry SDK
Use OpenTelemetry for all tracing instrumentation:W3C Trace Context
Propagate trace context across all service boundaries:Span Naming Conventions
| Span Type | Pattern | Example |
|---|---|---|
| HTTP server | HTTP {method} {route} | HTTP GET /api/users/:id |
| HTTP client | HTTP {method} {host}{path} | HTTP POST api.stripe.com/ |
| DB query | {db.system} {operation} | firestore getDoc |
| Queue | {queue} {operation} | tasks-queue publish |
| Internal | {module}.{function} | auth.verifyToken |
Required Span Attributes
Sampling Strategies
Metrics
OpenTelemetry Metrics SDK
Metric Naming
Format:{service}.{domain}.{metric}_{unit} in snake_case
Example: api.auth.login_duration_ms
Instrument Types
| Instrument | Use Case | Example |
|---|---|---|
| Counter | Monotonically increasing totals | http.requests_total |
| Histogram | Distributions (latency, size) | http.request_duration_ms |
| Gauge | Point-in-time values | db.connection_pool_active |
| UpDownCounter | Values that increase and decrease | queue.messages_pending |
Example Metrics
Histogram Buckets for Latency
Cardinality Management
Never use unbounded values as labels:SLO / SLI Definitions
Service Level Indicators (SLIs)
Define as ratios of good events to total events:| SLI | Definition | Measurement Source |
|---|---|---|
| Availability | Requests returning non-5xx / total requests | Load balancer logs |
| Latency | Requests completing < threshold / total | Tracing p99 |
| Error rate | Failed operations / total operations | Application metrics |
| Freshness | Data updated within SLA / total records | Background job metrics |
Service Level Objectives (SLOs)
Typical starting points:- Availability: 99.9% (43 min/month error budget)
- Latency: p99 < 500ms
Error Budgets
Burn Rate Alerts
Multi-window approach:Alerting
Severity Levels
| Severity | Criteria | Response Time | Notification |
|---|---|---|---|
| P1 | Service down, data loss risk | 15 min | Page on-call + Slack |
| P2 | Degraded performance, SLO at risk | 1 hour | Page on-call |
| P3 | Non-critical issue, workaround exists | Next business day | Slack channel |
| P4 | Cosmetic / low-impact | Sprint backlog | Ticket only |
Runbooks
Every alert must link to a runbook with:- Symptoms
- Likely causes
- Diagnostic steps
- Remediation actions
Alert Fatigue Prevention
- Tune thresholds to < 5 actionable alerts per on-call shift
- Suppress duplicate alerts within 10-minute dedup window
- Review alert quality monthly: snooze/delete alerts with < 20% action rate
Escalation Policies
Structured Error Reporting
Sentry Integration
Breadcrumbs
Capture the last 50 user actions:Custom Fingerprints
OpenTelemetry Semantic Conventions
Follow OpenTelemetry Semantic Conventions v1.29+ for consistent attribute naming.Standard Attribute Namespaces
| Namespace | Scope | Key Attributes |
|---|---|---|
http.* | HTTP spans | http.request.method, http.response.status_code, http.route, url.full |
db.* | Database spans | db.system, db.operation.name, db.collection.name, db.query.text (sanitized) |
rpc.* | RPC spans | rpc.system, rpc.service, rpc.method, rpc.grpc.status_code |
messaging.* | Message queue spans | messaging.system, messaging.operation.type, messaging.destination.name |
faas.* | Serverless invocations | faas.trigger, faas.invoked_name, faas.coldstart |
cloud.* | Cloud provider context | cloud.provider, cloud.region, cloud.account.id |
Resource Semantic Conventions
Every service must declare resource attributes at startup:| Attribute | Requirement | Description |
|---|---|---|
service.name | Required | Logical service name (e.g., api-gateway) |
service.version | Recommended | Semantic version (e.g., 1.4.2) |
deployment.environment.name | Recommended | Environment (e.g., production) |
service.instance.id | Recommended | Unique instance ID (e.g., pod name) |
Span Status Codes
| Code | When to Set |
|---|---|
UNSET | Default (operation completed without indicating error) |
OK | Explicitly successful (use sparingly) |
ERROR | Operation failed (exception caught, 5xx response) |
ERROR for:
- Server-side errors (5xx)
- Unhandled exceptions
ERROR for client errors (4xx) — those are valid responses.
Dashboard Standards
Required Dashboards Per Service
| Dashboard | Contents |
|---|---|
| Service Health | Request rate, error rate, latency p50/p95/p99, saturation |
| Business Metrics | Key domain counters, conversion funnels, feature adoption |
| Dependencies | Upstream/downstream latency, error rates, circuit breaker state |
| Infrastructure | CPU, memory, disk, connection pools, queue depth |
Dashboard-as-Code
Define dashboards in version-controlled JSON/YAML (Grafana provisioning, Terraform). No manual dashboard creation in production.Panel Requirements
- Descriptive title
- Unit labels
- Threshold lines for SLO targets
- Link to relevant runbook or alert
Enforcement
CI gates:- Logs use structured format (lint check)
- No
console.login production code - OpenTelemetry instrumentation present
- Structured logging with required fields
- No sensitive data in logs
- Correlation ID propagated
- Spans named according to conventions
- Metrics use OpenTelemetry SDK
- Alert runbooks linked
Related Rules
- Error Handling — Error logging and correlation
- Security Patterns — Log sanitization
- API Design — Tracing API requests

