This guide maps observability tools to layers of the Linux operating system, from user-space applications down to hardware, providing a mental model for selecting the right tool during analysis or incident response.
System Layers Overview
Application & User-Space
Process behavior, system calls, library calls, and application-level metrics
System Libraries & Syscalls
Transitions between user-space and kernel-space, syscall latency
Kernel Subsystems
Filesystems, memory management, scheduling, networking internals
Device Drivers & Block Layer
I/O flow through the Linux block subsystem
Storage & Swap
Physical disks, logical volumes, controllers, swap usage
Network Stack & NICs
Network interfaces, Ethernet drivers, ports, NIC statistics
Hardware
CPU, RAM, buses, performance counters, NUMA
System-Wide Tools
Multi-layer observability and historical metrics
1. Application & User-Space Observability
These tools inspect behavior at the process and application level, including interactions with system libraries.Tools
| Tool | Purpose |
|---|---|
| strace | Traces system calls made by an application |
| ltrace | Traces dynamic library calls |
| ss | Modern socket statistics (replacement for netstat) |
| netstat | Legacy connection state overview |
| sysdig | System-wide syscall/event capture and filtering |
| lsof | Lists open files, sockets, pipes |
| pidstat | Per-process CPU, memory, I/O, threads |
| pcstat | Page cache statistics for specific files |
When to Use
Application Debugging
Why an application is slow or blocked
Network Analysis
Network usage per process
Security Auditing
Open files and ports audit
Performance Tuning
Syscall patterns for optimization
2. System Libraries & Syscall Interface
This layer sits between applications and the kernel, helping examine transitions between user-space and kernel-space.Tools
strace / ltrace
strace / ltrace
Observe execution flow into syscalls and libraries
perf
perf
Syscall latency, profiling, hotspots
ftrace
ftrace
Built-in kernel tracer for syscalls and function calls
SystemTap (stap)
SystemTap (stap)
Programmable probes for syscalls
LTTng
LTTng
High-performance tracing for production systems
eBPF / bpftrace
eBPF / bpftrace
Modern, safe kernel-level instrumentation
3. Kernel Subsystems Observability
The kernel handles filesystems, memory management, scheduling, and networking. These tools inspect internal mechanisms.Core Tools
| Tool | Function |
|---|---|
| perf | Scheduler behavior, CPU cycles, kernel hotspots |
| tcpdump | Raw packet capture at IP/Ethernet layers |
| iptraf | Lightweight network utilization monitor |
| vmstat | Processes, memory, swap, I/O, interrupts |
| slabtop | Kernel slab allocator usage |
| free | Memory allocation breakdown |
| pidstat | Scheduler awareness and per-thread stats |
| tiptop | Per-thread metrics using hardware counters |
Use Cases
4. Device Drivers & Block Layer Observability
These tools examine I/O as it flows through the Linux block subsystem.Tools
iostat
Block device throughput and latency
iotop
Per-process disk I/O usage
blktrace
Detailed block layer tracing
perf / tiptop
Device driver profiling
Use these tools for troubleshooting slow disk I/O, detecting I/O starvation, or analyzing LVM/RAID performance issues.
5. Storage & Swap Observability
Focusing on physical disks, logical volumes, controllers, and swap usage.Key Tools
6. Network Stack & NIC Observability
Tools for examining network interfaces, Ethernet drivers, ports, and NIC statistics.Network Tools
| Tool | Purpose |
|---|---|
| tcpdump | Packet-level visibility |
| ss / netstat | Connections and sockets |
| iptraf | Per-interface traffic charts |
| ethtool | NIC driver settings and link state |
| nicstat | Interface utilization |
| lldptool | LLDP neighbor discovery |
| snmpget | SNMP-based network metrics |
Common Use Cases
Packet Issues
Drops, retransmits, MTU mismatches
NIC Tuning
Offload settings (TSO, GRO, etc.)
Link Problems
Speed/duplex mismatch troubleshooting
7. Hardware Observability
Insights into how the hardware itself behaves — CPU frequency, power states, performance counters, NUMA locality, memory pressure, cache behavior, and bus throughput.CPU Tools
mpstat
mpstat
Reports CPU usage per core, showing utilization, steal time, IRQ time, and more.
top
top
Real-time process monitoring with CPU, load average, and per-thread breakdowns.
ps
ps
Snapshot of process states, CPU usage, memory usage, and scheduling information.
pidstat
pidstat
Per-thread and per-process CPU utilization, context switching, and scheduling metrics.
perf
perf
Hardware performance counter profiler (cycles, cache misses, branch mispredictions).
turbostat
turbostat
Intel-specific tool showing CPU frequencies, C-states, P-states, and turbo boost behavior.
rdmsr
rdmsr
Reads CPU model-specific registers (MSRs) for extremely low-level introspection.
Memory Tools
| Tool | Function |
|---|---|
| vmstat | Paging, swapping, memory pressure, interrupts |
| free | Total, used, cached, available memory |
| slabtop | Kernel slab allocator statistics |
| numastat | NUMA locality, node memory distribution |
| perf (memory events) | Hardware counters for RAM, cache, memory bus |
When to Use Hardware Tools
8. System-Wide Observability Tools
These tools cover multiple layers at once, providing holistic system visibility.Tools
sar
Historic performance logs across CPU, memory, I/O, network
dstat
Live multi-metric system aggregation
sysdig
Holistic tracing across syscalls, network, containers
/proc filesystem
Raw kernel data for metrics, states, drivers, interfaces
Use system-wide tools for incident response, baselining, long-term trending, and anomaly detection.
Practical Use Cases
Root Cause Analysis (RCA)
Performance Tuning
- Scheduler tracing for latency-sensitive workloads
- NIC tuning via
ethtoolfor high-throughput environments - Storage insight for LVM/RAID/SSD/HDD tuning
DevSecOps / Security
- eBPF tools for detecting suspicious syscalls
- lsof for auditing unexpected open sockets/files
- sysdig rules for behavioral anomaly detection
Observability in DevSecOps
Observability is not just operational — it is security-critical:Intrusion Detection
Detect unusual syscall patterns (possible intrusion)
Crypto Mining Detection
Identify crypto miners via CPU and scheduler patterns
Exfiltration Detection
Spot data exfiltration via abnormal NIC or TCP behavior
Hardening Validation
Validate hardening changes improve rather than degrade performance
Quick Command Reference
References
- Brendan Gregg — Linux Performance Tools
- Kernel documentation — https://www.kernel.org/doc/
- Sysdig, LTTng, SystemTap official documentation
- eBPF / bpftrace reference guides