Skip to main content

Observability

Scope: Conditional (applies when observability patterns are used)
Rule ID: hatch3r-observability
Defines observability standards including structured logging, distributed tracing with OpenTelemetry, metrics collection, SLO/SLI definitions, alerting, and error reporting.

Structured Logging

Use Structured JSON Logging

import { logger } from './logger';

// ✅ Structured logging
logger.info('User login', {
  userId: user.id,
  email: user.email,
  correlationId: req.correlationId,
  service: 'auth-service',
  environment: process.env.NODE_ENV,
  version: process.env.APP_VERSION,
});

// ❌ No console.log in production code
console.log('User logged in:', user.id);

Log Levels

LevelWhen to Use
errorFailures that require immediate attention
warnDegraded state or unexpected conditions
infoState changes, business events
debugDevelopment-only detailed info

Required Fields

Every log entry must include:
interface LogEntry {
  level: 'error' | 'warn' | 'info' | 'debug';
  message: string;
  correlationId: string;      // Trace requests across services
  userId?: string;            // If available
  service: string;            // Service name
  environment: string;        // dev/staging/production
  version: string;            // App version (git SHA or semver)
  timestamp: string;          // ISO 8601
  [key: string]: unknown;     // Additional context
}

Never Log Sensitive Data

const sanitize = (obj: any) => {
  const sensitive = ['password', 'token', 'apiKey', 'secret', 'ssn', 'creditCard'];
  const sanitized = { ...obj };
  for (const field of sensitive) {
    if (field in sanitized) {
      sanitized[field] = '[REDACTED]';
    }
  }
  return sanitized;
};

logger.info('User created', sanitize({ email, password }));

Client-Side Logging

// Send errors to reporting service, not just console
window.addEventListener('error', (event) => {
  errorReporter.log({
    message: event.message,
    stack: event.error?.stack,
    url: window.location.href,
    correlationId: getCorrelationId(),
  });
});

Performance Budget for Logging

Logging must not add > 10ms latency to hot paths. Use async logging and batching:
const logger = pino({
  level: 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  transport: {
    target: 'pino/file',  // Async file transport
    options: { destination: '/var/log/app.log' },
  },
});

Log Sampling

For high-volume debug logs in production:
const shouldLog = Math.random() < 0.01; // 1% sample rate

if (shouldLog) {
  logger.debug('High-volume debug message', { data });
}

Distributed Tracing

OpenTelemetry SDK

Use OpenTelemetry for all tracing instrumentation:
import { trace } from '@opentelemetry/api';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';

// Initialize once at startup
const provider = new NodeTracerProvider({
  resource: new Resource({
    'service.name': 'api-gateway',
    'service.version': '1.2.3',
    'deployment.environment': 'production',
  }),
});

provider.register();

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

W3C Trace Context

Propagate trace context across all service boundaries:
// Outgoing HTTP request
const response = await fetch('https://api.example.com/data', {
  headers: {
    'traceparent': req.headers.traceparent,  // Propagate trace context
    'tracestate': req.headers.tracestate,
  },
});

Span Naming Conventions

Span TypePatternExample
HTTP serverHTTP {method} {route}HTTP GET /api/users/:id
HTTP clientHTTP {method} {host}{path}HTTP POST api.stripe.com/
DB query{db.system} {operation}firestore getDoc
Queue{queue} {operation}tasks-queue publish
Internal{module}.{function}auth.verifyToken

Required Span Attributes

const span = trace.getActiveSpan();
span?.setAttributes({
  'service.name': 'api-gateway',
  'service.version': '1.2.3',
  'deployment.environment': 'production',
  'user.id': user.id,  // Domain-specific
  'tenant.id': tenant.id,
});

Sampling Strategies

import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

// 10% sample rate in production
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

// Always sample errors and slow requests
class CustomSampler implements Sampler {
  shouldSample(context, traceId, name, spanKind, attributes, links) {
    // Always sample errors
    if (attributes['http.status_code'] >= 500) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }

    // Always sample slow requests (> p95)
    if (attributes['http.duration'] > 1000) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }

    // Otherwise, 10% sample
    return new TraceIdRatioBasedSampler(0.1).shouldSample(
      context, traceId, name, spanKind, attributes, links
    );
  }
}

Metrics

OpenTelemetry Metrics SDK

import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

const meterProvider = new MeterProvider({
  resource: new Resource({
    'service.name': 'api-gateway',
  }),
});

const prometheusExporter = new PrometheusExporter({ port: 9464 });
meterProvider.addMetricReader(prometheusExporter);

const meter = meterProvider.getMeter('api-gateway');

Metric Naming

Format: {service}.{domain}.{metric}_{unit} in snake_case Example: api.auth.login_duration_ms

Instrument Types

InstrumentUse CaseExample
CounterMonotonically increasing totalshttp.requests_total
HistogramDistributions (latency, size)http.request_duration_ms
GaugePoint-in-time valuesdb.connection_pool_active
UpDownCounterValues that increase and decreasequeue.messages_pending

Example Metrics

// Counter
const requestCounter = meter.createCounter('http.requests_total', {
  description: 'Total HTTP requests',
});

requestCounter.add(1, {
  method: req.method,
  route: req.route.path,
  status: res.statusCode,
});

// Histogram
const durationHistogram = meter.createHistogram('http.request_duration_ms', {
  description: 'HTTP request duration',
  unit: 'ms',
});

durationHistogram.record(duration, {
  method: req.method,
  route: req.route.path,
});

// Gauge
const activeConnections = meter.createObservableGauge('db.connections_active', {
  description: 'Active database connections',
});

activeConnections.addCallback((result) => {
  result.observe(pool.activeConnections);
});

Histogram Buckets for Latency

const buckets = [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]; // ms

Cardinality Management

Never use unbounded values as labels:
// ❌ Unbounded cardinality (user IDs)
requestCounter.add(1, { userId: req.user.id });

// ✅ Bounded cardinality
requestCounter.add(1, { userRole: req.user.role });
Cap label cardinality to < 100 unique values per metric.

SLO / SLI Definitions

Service Level Indicators (SLIs)

Define as ratios of good events to total events:
SLIDefinitionMeasurement Source
AvailabilityRequests returning non-5xx / total requestsLoad balancer logs
LatencyRequests completing < threshold / totalTracing p99
Error rateFailed operations / total operationsApplication metrics
FreshnessData updated within SLA / total recordsBackground job metrics

Service Level Objectives (SLOs)

Typical starting points:
  • Availability: 99.9% (43 min/month error budget)
  • Latency: p99 < 500ms

Error Budgets

const errorBudget = 1 - sloTarget; // e.g., 1 - 0.999 = 0.001 (0.1%)

// Track on rolling 30-day window
const budgetRemaining = errorBudget - actualErrorRate;

if (budgetRemaining < 0) {
  // SLO violated, slow down releases
}

Burn Rate Alerts

Multi-window approach:
// Fast-burn alert: 2% budget consumed in 1 hour
if (errorRateLast1h > 0.02 && errorRateLast5m > 0.02) {
  alert('Fast SLO burn');
}

// Slow-burn alert: 5% consumed in 6 hours
if (errorRateLast6h > 0.05 && errorRateLast30m > 0.05) {
  alert('Slow SLO burn');
}
Alert only when both windows confirm.

Alerting

Severity Levels

SeverityCriteriaResponse TimeNotification
P1Service down, data loss risk15 minPage on-call + Slack
P2Degraded performance, SLO at risk1 hourPage on-call
P3Non-critical issue, workaround existsNext business daySlack channel
P4Cosmetic / low-impactSprint backlogTicket only

Runbooks

Every alert must link to a runbook with:
  • Symptoms
  • Likely causes
  • Diagnostic steps
  • Remediation actions

Alert Fatigue Prevention

  • Tune thresholds to < 5 actionable alerts per on-call shift
  • Suppress duplicate alerts within 10-minute dedup window
  • Review alert quality monthly: snooze/delete alerts with < 20% action rate

Escalation Policies

escalation:
  - level: 1
    notify: on-call-primary
    timeout: 15m
  - level: 2
    notify: on-call-secondary
    timeout: 15m
  - level: 3
    notify: engineering-lead

Structured Error Reporting

Sentry Integration

import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.APP_VERSION,  // git SHA or semver
  tracesSampleRate: 0.1,  // 10% for transactions
  sampleRate: 1.0,        // 100% for errors
});

// Capture error with context
Sentry.captureException(error, {
  tags: {
    correlationId: req.correlationId,
  },
  user: {
    id: user.id,
    email: user.email,  // Never include PII beyond what's necessary
  },
  extra: {
    requestPath: req.path,
    requestMethod: req.method,
  },
});
Capture the last 50 user actions:
Sentry.addBreadcrumb({
  category: 'user-action',
  message: 'User clicked checkout button',
  level: 'info',
  data: {
    cartItems: cart.items.length,
    total: cart.total,
  },
});

Custom Fingerprints

Sentry.captureException(error, {
  fingerprint: ['{{ default }}', error.code],  // Group by error code
});

OpenTelemetry Semantic Conventions

Follow OpenTelemetry Semantic Conventions v1.29+ for consistent attribute naming.

Standard Attribute Namespaces

NamespaceScopeKey Attributes
http.*HTTP spanshttp.request.method, http.response.status_code, http.route, url.full
db.*Database spansdb.system, db.operation.name, db.collection.name, db.query.text (sanitized)
rpc.*RPC spansrpc.system, rpc.service, rpc.method, rpc.grpc.status_code
messaging.*Message queue spansmessaging.system, messaging.operation.type, messaging.destination.name
faas.*Serverless invocationsfaas.trigger, faas.invoked_name, faas.coldstart
cloud.*Cloud provider contextcloud.provider, cloud.region, cloud.account.id

Resource Semantic Conventions

Every service must declare resource attributes at startup:
AttributeRequirementDescription
service.nameRequiredLogical service name (e.g., api-gateway)
service.versionRecommendedSemantic version (e.g., 1.4.2)
deployment.environment.nameRecommendedEnvironment (e.g., production)
service.instance.idRecommendedUnique instance ID (e.g., pod name)

Span Status Codes

CodeWhen to Set
UNSETDefault (operation completed without indicating error)
OKExplicitly successful (use sparingly)
ERROROperation failed (exception caught, 5xx response)
Set to ERROR for:
  • Server-side errors (5xx)
  • Unhandled exceptions
Do not set ERROR for client errors (4xx) — those are valid responses.

Dashboard Standards

Required Dashboards Per Service

DashboardContents
Service HealthRequest rate, error rate, latency p50/p95/p99, saturation
Business MetricsKey domain counters, conversion funnels, feature adoption
DependenciesUpstream/downstream latency, error rates, circuit breaker state
InfrastructureCPU, memory, disk, connection pools, queue depth

Dashboard-as-Code

Define dashboards in version-controlled JSON/YAML (Grafana provisioning, Terraform). No manual dashboard creation in production.

Panel Requirements

  • Descriptive title
  • Unit labels
  • Threshold lines for SLO targets
  • Link to relevant runbook or alert

Enforcement

CI gates:
  • Logs use structured format (lint check)
  • No console.log in production code
  • OpenTelemetry instrumentation present
Code review checklist:
  • Structured logging with required fields
  • No sensitive data in logs
  • Correlation ID propagated
  • Spans named according to conventions
  • Metrics use OpenTelemetry SDK
  • Alert runbooks linked

Build docs developers (and LLMs) love