Skip to main content

OpenTelemetry Collector

The OpenTelemetry Collector is a vendor-agnostic implementation for receiving, processing, and exporting telemetry data. This infrastructure uses a custom-built OTel Collector with specific receivers, processors, and exporters tailored for the observability stack.

Purpose

The OTel Collector serves as the central aggregation point for all observability data:
  • Traces → Tempo (via OTLP/gRPC)
  • Metrics → Prometheus (via Remote Write)
  • Logs → Loki (via OTLP/HTTP)
This architecture allows services to use a single OTLP endpoint for all telemetry, simplifying instrumentation.

Custom Build

The OTel Collector is built using Nix to include only the required components, reducing image size and attack surface.

Build Configuration

From flake.nix:67-86:
otel-collector = otelPkgs.buildOtelCollector {
  pname = "otel-collector";
  version = "0.147.0";
  config = {
    receivers = [
      { gomod = "go.opentelemetry.io/collector/receiver/otlpreceiver v0.147.0"; }
    ];
    processors = [
      { gomod = "go.opentelemetry.io/collector/processor/batchprocessor v0.147.0"; }
    ];
    exporters = [
      { gomod = "go.opentelemetry.io/collector/exporter/otlpexporter v0.147.0"; }
      { gomod = "go.opentelemetry.io/collector/exporter/otlphttpexporter v0.147.0"; }
      {
        gomod = "github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter v0.147.0";
      }
    ];
  };
  vendorHash = "sha256-NtieNKEtGgdKK1K4JWGzk/z5SME9fuhqE7vXZEdrRcs=";
};

Components

  • Receivers: OTLP (gRPC on :4317, HTTP on :4318)
  • Processors: Batch processor for performance optimization
  • Exporters:
    • otlp - For traces to Tempo
    • otlphttp - For logs to Loki
    • prometheusremotewrite - For metrics to Prometheus

Deployment Configuration

From nixidy/env/local/otel-collector.nix:
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch: {}

exporters:
  prometheusremotewrite:
    endpoint: http://kube-prometheus-stack-prometheus.observability:9090/api/v1/write
  otlp:
    endpoint: tempo.observability:4317
    tls:
      insecure: true
  otlphttp:
    endpoint: http://loki.observability:3100/otlp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp]

Endpoints

ProtocolPortPurpose
gRPC4317OTLP/gRPC endpoint for all telemetry
HTTP4318OTLP/HTTP endpoint (alternative)
Access within cluster:
# gRPC endpoint
otel-collector.observability.svc.cluster.local:4317

# HTTP endpoint
http://otel-collector.observability.svc.cluster.local:4318

Resource Limits

requests:
  cpu: 100m
  memory: 128Mi
limits:
  cpu: 500m
  memory: 512Mi

Image Build and Caching

The collector image is built with Nix and cached in Cloudflare R2 for fast bootstrap times.

Build Process

# Build locally (used during development)
load-otel-collector-image build

# Smart mode: Try R2 cache → local cache → build
load-otel-collector-image smart

# Full mode: smart + load into kind
load-otel-collector-image full

R2 Cache

CI automatically builds and uploads the OTel Collector image to R2 when flake.nix or flake.lock changes:
  • URL: $R2_BUCKET_URL/{arch}/{hash}.tar
  • Architectures: x86_64-linux, aarch64-linux
  • Hash: Based on flake.nix + flake.lock content
This eliminates the ~3-5 minute build time on subsequent bootstrap runs.

Integration with Services

Istio

Istio sidecars (or ztunnel in ambient mode) send traces to the OTel Collector:
# istio/telemetry.yaml
tracing:
  - providers:
    - name: otel
  customTags:
    cluster_name:
      literal:
        value: "microservice-infra"

Traefik

Traefik forwards traces via OTLP:
tracing:
  otlp:
    grpc:
      endpoint: "otel-collector.observability.svc.cluster.local:4317"

Application Code

Applications can send telemetry directly to the OTel Collector:
# Environment variable for OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability:4318

Data Flow

┌─────────────┐
│   Istio     │──┐
└─────────────┘  │

┌─────────────┐  │    ┌──────────────────┐
│   Traefik   │──┼───▶│ OTel Collector   │
└─────────────┘  │    │  (batch, route)  │
                 │    └──────────────────┘
┌─────────────┐  │            │
│   Apps      │──┘            │
└─────────────┘               │
                              ├──▶ Tempo (traces)
                              ├──▶ Prometheus (metrics)
                              └──▶ Loki (logs)

Tempo

Distributed tracing backend

Prometheus

Metrics storage and query engine

Loki

Log aggregation system

Observability Architecture

Complete observability stack

Troubleshooting

Check OTel Collector Status

kubectl get pods -n observability -l app.kubernetes.io/name=otel-collector
kubectl logs -n observability -l app.kubernetes.io/name=otel-collector

Test OTLP Endpoint

# Port-forward to test locally
kubectl port-forward -n observability svc/otel-collector 4317:4317

# Send test span (requires grpcurl)
grpcurl -plaintext -d '{...}' localhost:4317 \
  opentelemetry.proto.collector.trace.v1.TraceService/Export

Verify Data Flow

# Check Tempo for traces
kubectl port-forward -n observability svc/tempo 3100:3100
curl http://localhost:3100/api/traces/{trace-id}

# Check Prometheus for metrics
kubectl port-forward -n observability svc/kube-prometheus-stack-prometheus 9090:9090
# Visit http://localhost:9090

# Check Loki for logs
kubectl port-forward -n observability svc/loki 3100:3100
curl -G http://localhost:3100/loki/api/v1/query_range \
  --data-urlencode 'query={job="otel-collector"}'

Rebuild Image

If the image build fails or needs updating:
# Clean rebuild
load-otel-collector-image build

# Load into kind
kind load docker-image otel-collector:latest --name microservice-infra

Build docs developers (and LLMs) love