Skip to main content
Observability is the ability to understand what a Linux system is doing internally by examining the signals it emits — metrics, logs, traces, and events. This guide provides a structured overview of Linux observability tools, grouped by the system layers they inspect. It is designed as a practical reference for troubleshooting, performance engineering, capacity planning, and DevSecOps workflows.
This guide maps observability tools to layers of the Linux operating system, from user-space applications down to hardware, providing a mental model for selecting the right tool during analysis or incident response.

System Layers Overview

Application & User-Space

Process behavior, system calls, library calls, and application-level metrics

System Libraries & Syscalls

Transitions between user-space and kernel-space, syscall latency

Kernel Subsystems

Filesystems, memory management, scheduling, networking internals

Device Drivers & Block Layer

I/O flow through the Linux block subsystem

Storage & Swap

Physical disks, logical volumes, controllers, swap usage

Network Stack & NICs

Network interfaces, Ethernet drivers, ports, NIC statistics

Hardware

CPU, RAM, buses, performance counters, NUMA

System-Wide Tools

Multi-layer observability and historical metrics

1. Application & User-Space Observability

These tools inspect behavior at the process and application level, including interactions with system libraries.

Tools

ToolPurpose
straceTraces system calls made by an application
ltraceTraces dynamic library calls
ssModern socket statistics (replacement for netstat)
netstatLegacy connection state overview
sysdigSystem-wide syscall/event capture and filtering
lsofLists open files, sockets, pipes
pidstatPer-process CPU, memory, I/O, threads
pcstatPage cache statistics for specific files

When to Use

Application Debugging

Why an application is slow or blocked

Network Analysis

Network usage per process

Security Auditing

Open files and ports audit

Performance Tuning

Syscall patterns for optimization

2. System Libraries & Syscall Interface

This layer sits between applications and the kernel, helping examine transitions between user-space and kernel-space.

Tools

Observe execution flow into syscalls and libraries
Syscall latency, profiling, hotspots
Built-in kernel tracer for syscalls and function calls
Programmable probes for syscalls
High-performance tracing for production systems
Modern, safe kernel-level instrumentation
For production systems, use eBPF/bpftrace or LTTng for high-resolution tracing with minimal overhead.

3. Kernel Subsystems Observability

The kernel handles filesystems, memory management, scheduling, and networking. These tools inspect internal mechanisms.

Core Tools

ToolFunction
perfScheduler behavior, CPU cycles, kernel hotspots
tcpdumpRaw packet capture at IP/Ethernet layers
iptrafLightweight network utilization monitor
vmstatProcesses, memory, swap, I/O, interrupts
slabtopKernel slab allocator usage
freeMemory allocation breakdown
pidstatScheduler awareness and per-thread stats
tiptopPer-thread metrics using hardware counters

Use Cases

1

Memory Issues

Identify memory pressure, leaks, or slab exhaustion
2

Network Problems

Determine packet loss or congestion
3

Scheduler Latency

Analyze scheduler-induced delays
4

Kernel Performance

Understand kernel-side bottlenecks

4. Device Drivers & Block Layer Observability

These tools examine I/O as it flows through the Linux block subsystem.

Tools

iostat

Block device throughput and latency

iotop

Per-process disk I/O usage

blktrace

Detailed block layer tracing

perf / tiptop

Device driver profiling
Use these tools for troubleshooting slow disk I/O, detecting I/O starvation, or analyzing LVM/RAID performance issues.

5. Storage & Swap Observability

Focusing on physical disks, logical volumes, controllers, and swap usage.

Key Tools

# View I/O performance
iostat -x 1

# Identify I/O-intensive processes
iotop -o

# Detailed I/O event tracing
blktrace -d /dev/sda

# Check swap usage
swapon -s
High swap usage often indicates memory pressure. Use vmstat and free to diagnose memory issues before they impact performance.

6. Network Stack & NIC Observability

Tools for examining network interfaces, Ethernet drivers, ports, and NIC statistics.

Network Tools

ToolPurpose
tcpdumpPacket-level visibility
ss / netstatConnections and sockets
iptrafPer-interface traffic charts
ethtoolNIC driver settings and link state
nicstatInterface utilization
lldptoolLLDP neighbor discovery
snmpgetSNMP-based network metrics

Common Use Cases

Packet Issues

Drops, retransmits, MTU mismatches

NIC Tuning

Offload settings (TSO, GRO, etc.)

Link Problems

Speed/duplex mismatch troubleshooting

7. Hardware Observability

Insights into how the hardware itself behaves — CPU frequency, power states, performance counters, NUMA locality, memory pressure, cache behavior, and bus throughput.

CPU Tools

Reports CPU usage per core, showing utilization, steal time, IRQ time, and more.
Real-time process monitoring with CPU, load average, and per-thread breakdowns.
Snapshot of process states, CPU usage, memory usage, and scheduling information.
Per-thread and per-process CPU utilization, context switching, and scheduling metrics.
Hardware performance counter profiler (cycles, cache misses, branch mispredictions).
Intel-specific tool showing CPU frequencies, C-states, P-states, and turbo boost behavior.
Reads CPU model-specific registers (MSRs) for extremely low-level introspection.

Memory Tools

ToolFunction
vmstatPaging, swapping, memory pressure, interrupts
freeTotal, used, cached, available memory
slabtopKernel slab allocator statistics
numastatNUMA locality, node memory distribution
perf (memory events)Hardware counters for RAM, cache, memory bus

When to Use Hardware Tools

1

NUMA Analysis

Debug NUMA locality and cross-node memory access
2

CPU Throttling

Investigate frequency scaling or thermal throttling
3

Memory Pressure

Analyze leaking workloads or kernel slab issues
4

High-Performance Tuning

Optimize compute-heavy or latency-sensitive workloads

8. System-Wide Observability Tools

These tools cover multiple layers at once, providing holistic system visibility.

Tools

sar

Historic performance logs across CPU, memory, I/O, network

dstat

Live multi-metric system aggregation

sysdig

Holistic tracing across syscalls, network, containers

/proc filesystem

Raw kernel data for metrics, states, drivers, interfaces
Use system-wide tools for incident response, baselining, long-term trending, and anomaly detection.

Practical Use Cases

Root Cause Analysis (RCA)

1

Identify the Layer

Determine if slowdown is CPU, memory, network, or storage related
2

Trace the Issue

Follow a misbehaving process through syscalls into the kernel
3

Compare Baselines

Compare observed performance against baseline metrics

Performance Tuning

  • Scheduler tracing for latency-sensitive workloads
  • NIC tuning via ethtool for high-throughput environments
  • Storage insight for LVM/RAID/SSD/HDD tuning

DevSecOps / Security

A secure system is one that is understood, not just hardened.
  • eBPF tools for detecting suspicious syscalls
  • lsof for auditing unexpected open sockets/files
  • sysdig rules for behavioral anomaly detection

Observability in DevSecOps

Observability is not just operational — it is security-critical:

Intrusion Detection

Detect unusual syscall patterns (possible intrusion)

Crypto Mining Detection

Identify crypto miners via CPU and scheduler patterns

Exfiltration Detection

Spot data exfiltration via abnormal NIC or TCP behavior

Hardening Validation

Validate hardening changes improve rather than degrade performance

Quick Command Reference

# Application layer
strace -p <PID>              # Trace system calls
lsof -p <PID>                # List open files
ss -tunap                    # Socket statistics

# Kernel subsystems
vmstat 1                     # Memory and I/O stats
perf top                     # Real-time kernel profiling
slabtop                      # Kernel memory allocator

# Storage layer
iostat -x 1                  # Disk I/O statistics
iotop -o                     # I/O per process

# Network layer
tcpdump -i eth0              # Packet capture
ethtool eth0                 # NIC settings

# Hardware layer
mpstat -P ALL 1              # Per-CPU statistics
numastat                     # NUMA memory distribution

# System-wide
sar -A                       # Historical system metrics
dstat -tclmdrn               # Live multi-metric view
Start with high-level tools like top, vmstat, and iostat to identify the problem layer, then drill down with specialized tools.

References

  • Brendan Gregg — Linux Performance Tools
  • Kernel documentation — https://www.kernel.org/doc/
  • Sysdig, LTTng, SystemTap official documentation
  • eBPF / bpftrace reference guides

Build docs developers (and LLMs) love