Skip to main content

Distributed Tracing Overview

Distributed tracing is essential for understanding the behavior of complex distributed systems. It allows you to track requests as they flow through multiple services and identify performance bottlenecks and failures.

Foundational Paper

Dapper: Google’s Distributed Systems Tracing Infrastructure

Dapper is Google’s large-scale distributed-systems tracing infrastructure. This seminal paper laid the foundation for modern distributed tracing systems and influenced the design of numerous open source projects.

Dapper Paper

Google’s approach to large-scale distributed systems tracing - essential reading for understanding modern tracing architectures
The Dapper paper introduces key concepts like trace trees, spans, and sampling strategies that are now standard in distributed tracing systems.

Open Source Tracing Projects

The following open source projects were directly inspired by Dapper’s design and provide production-ready distributed tracing capabilities:

Zipkin

A distributed tracing system that helps gather timing data for microservices architectures

Apache SkyWalking

Application performance monitoring system for distributed systems, especially designed for microservices, cloud native and container-based architectures

Pinpoint

An APM (Application Performance Management) tool for large-scale distributed systems written in Java

Apache HTrace

A tracing framework for use with distributed systems written in Java

Zipkin

Zipkin is one of the most widely adopted distributed tracing systems. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data.

Apache SkyWalking

Apache SkyWalking is an application performance monitoring system designed for microservices, cloud native, and container-based architectures. It provides distributed tracing, service mesh telemetry analysis, and metric aggregation.

Pinpoint

Pinpoint is an APM tool developed by Naver for large-scale distributed systems. It’s particularly well-suited for Java-based applications and provides detailed insights into application performance with minimal overhead.

Apache HTrace

HTrace is a tracing framework designed for distributed systems written in Java. It integrates with various Hadoop ecosystem components and provides flexible tracing capabilities.
All of these projects follow the core principles established by Dapper: low overhead, application-level transparency, and scalability to handle the volume of data generated by large distributed systems.

Key Tracing Concepts

When implementing distributed tracing, consider the sampling rate carefully. Too high and you’ll overwhelm your tracing infrastructure; too low and you might miss important failures.
Essential concepts in distributed tracing include:
  • Traces: The complete journey of a request through the system
  • Spans: Individual units of work within a trace
  • Sampling: Strategies for deciding which requests to trace
  • Context Propagation: Passing trace context across service boundaries
  • Aggregation: Combining trace data for analysis

Additional Resources

For more information on monitoring and observability in distributed systems:

Build docs developers (and LLMs) love