Linux Observability - safesploitOrg Documentation

Observability is the ability to understand what a Linux system is doing internally by examining the signals it emits — metrics, logs, traces, and events. This guide provides a structured overview of Linux observability tools, grouped by the system layers they inspect. It is designed as a practical reference for troubleshooting, performance engineering, capacity planning, and DevSecOps workflows.

This guide maps observability tools to layers of the Linux operating system, from user-space applications down to hardware, providing a mental model for selecting the right tool during analysis or incident response.

System Layers Overview

Application & User-Space

Process behavior, system calls, library calls, and application-level metrics

System Libraries & Syscalls

Transitions between user-space and kernel-space, syscall latency

Kernel Subsystems

Filesystems, memory management, scheduling, networking internals

Device Drivers & Block Layer

I/O flow through the Linux block subsystem

Storage & Swap

Physical disks, logical volumes, controllers, swap usage

Network Stack & NICs

Network interfaces, Ethernet drivers, ports, NIC statistics

Hardware

CPU, RAM, buses, performance counters, NUMA

System-Wide Tools

Multi-layer observability and historical metrics

1. Application & User-Space Observability

These tools inspect behavior at the process and application level, including interactions with system libraries.

Tools

Tool	Purpose
strace	Traces system calls made by an application
ltrace	Traces dynamic library calls
ss	Modern socket statistics (replacement for `netstat`)
netstat	Legacy connection state overview
sysdig	System-wide syscall/event capture and filtering
lsof	Lists open files, sockets, pipes
pidstat	Per-process CPU, memory, I/O, threads
pcstat	Page cache statistics for specific files

When to Use

Application Debugging

Why an application is slow or blocked

Network Analysis

Network usage per process

Security Auditing

Open files and ports audit

Performance Tuning

Syscall patterns for optimization

2. System Libraries & Syscall Interface

This layer sits between applications and the kernel, helping examine transitions between user-space and kernel-space.

Tools

strace / ltrace

Observe execution flow into syscalls and libraries

perf

Syscall latency, profiling, hotspots

ftrace

Built-in kernel tracer for syscalls and function calls

SystemTap (stap)

Programmable probes for syscalls

LTTng

High-performance tracing for production systems

eBPF / bpftrace

Modern, safe kernel-level instrumentation

For production systems, use eBPF/bpftrace or LTTng for high-resolution tracing with minimal overhead.

3. Kernel Subsystems Observability

The kernel handles filesystems, memory management, scheduling, and networking. These tools inspect internal mechanisms.

Core Tools

Tool	Function
perf	Scheduler behavior, CPU cycles, kernel hotspots
tcpdump	Raw packet capture at IP/Ethernet layers
iptraf	Lightweight network utilization monitor
vmstat	Processes, memory, swap, I/O, interrupts
slabtop	Kernel slab allocator usage
free	Memory allocation breakdown
pidstat	Scheduler awareness and per-thread stats
tiptop	Per-thread metrics using hardware counters

Use Cases

Memory Issues

Identify memory pressure, leaks, or slab exhaustion

Network Problems

Determine packet loss or congestion

Scheduler Latency

Analyze scheduler-induced delays

Kernel Performance

Understand kernel-side bottlenecks

4. Device Drivers & Block Layer Observability

These tools examine I/O as it flows through the Linux block subsystem.

Tools

iostat

Block device throughput and latency

iotop

Per-process disk I/O usage

blktrace

Detailed block layer tracing

perf / tiptop

Device driver profiling

Use these tools for troubleshooting slow disk I/O, detecting I/O starvation, or analyzing LVM/RAID performance issues.

5. Storage & Swap Observability

Focusing on physical disks, logical volumes, controllers, and swap usage.

Key Tools

# View I/O performance
iostat -x 1

# Identify I/O-intensive processes
iotop -o

# Detailed I/O event tracing
blktrace -d /dev/sda

# Check swap usage
swapon -s

High swap usage often indicates memory pressure. Use vmstat and free to diagnose memory issues before they impact performance.

6. Network Stack & NIC Observability

Tools for examining network interfaces, Ethernet drivers, ports, and NIC statistics.

Network Tools

Tool	Purpose
tcpdump	Packet-level visibility
ss / netstat	Connections and sockets
iptraf	Per-interface traffic charts
ethtool	NIC driver settings and link state
nicstat	Interface utilization
lldptool	LLDP neighbor discovery
snmpget	SNMP-based network metrics

Common Use Cases

Packet Issues

Drops, retransmits, MTU mismatches

NIC Tuning

Offload settings (TSO, GRO, etc.)

Link Problems

Speed/duplex mismatch troubleshooting

7. Hardware Observability

Insights into how the hardware itself behaves — CPU frequency, power states, performance counters, NUMA locality, memory pressure, cache behavior, and bus throughput.

CPU Tools

mpstat

Reports CPU usage per core, showing utilization, steal time, IRQ time, and more.

top

Real-time process monitoring with CPU, load average, and per-thread breakdowns.

Snapshot of process states, CPU usage, memory usage, and scheduling information.

pidstat

Per-thread and per-process CPU utilization, context switching, and scheduling metrics.

perf

Hardware performance counter profiler (cycles, cache misses, branch mispredictions).

turbostat

Intel-specific tool showing CPU frequencies, C-states, P-states, and turbo boost behavior.

rdmsr

Reads CPU model-specific registers (MSRs) for extremely low-level introspection.

Memory Tools

Tool	Function
vmstat	Paging, swapping, memory pressure, interrupts
free	Total, used, cached, available memory
slabtop	Kernel slab allocator statistics
numastat	NUMA locality, node memory distribution
perf (memory events)	Hardware counters for RAM, cache, memory bus

When to Use Hardware Tools

NUMA Analysis

Debug NUMA locality and cross-node memory access

CPU Throttling

Investigate frequency scaling or thermal throttling

Memory Pressure

Analyze leaking workloads or kernel slab issues

High-Performance Tuning

Optimize compute-heavy or latency-sensitive workloads

8. System-Wide Observability Tools

These tools cover multiple layers at once, providing holistic system visibility.

Tools

sar

Historic performance logs across CPU, memory, I/O, network

dstat

Live multi-metric system aggregation

sysdig

Holistic tracing across syscalls, network, containers

/proc filesystem

Raw kernel data for metrics, states, drivers, interfaces

Use system-wide tools for incident response, baselining, long-term trending, and anomaly detection.

Practical Use Cases

Root Cause Analysis (RCA)

Identify the Layer

Determine if slowdown is CPU, memory, network, or storage related

Trace the Issue

Follow a misbehaving process through syscalls into the kernel

Compare Baselines

Compare observed performance against baseline metrics

Performance Tuning

Scheduler tracing for latency-sensitive workloads
NIC tuning via ethtool for high-throughput environments
Storage insight for LVM/RAID/SSD/HDD tuning

DevSecOps / Security

A secure system is one that is understood, not just hardened.

eBPF tools for detecting suspicious syscalls
lsof for auditing unexpected open sockets/files
sysdig rules for behavioral anomaly detection

Observability in DevSecOps

Observability is not just operational — it is security-critical:

Intrusion Detection

Detect unusual syscall patterns (possible intrusion)

Crypto Mining Detection

Identify crypto miners via CPU and scheduler patterns

Exfiltration Detection

Spot data exfiltration via abnormal NIC or TCP behavior

Hardening Validation

Validate hardening changes improve rather than degrade performance

Quick Command Reference

# Application layer
strace -p <PID>              # Trace system calls
lsof -p <PID>                # List open files
ss -tunap                    # Socket statistics

# Kernel subsystems
vmstat 1                     # Memory and I/O stats
perf top                     # Real-time kernel profiling
slabtop                      # Kernel memory allocator

# Storage layer
iostat -x 1                  # Disk I/O statistics
iotop -o                     # I/O per process

# Network layer
tcpdump -i eth0              # Packet capture
ethtool eth0                 # NIC settings

# Hardware layer
mpstat -P ALL 1              # Per-CPU statistics
numastat                     # NUMA memory distribution

# System-wide
sar -A                       # Historical system metrics
dstat -tclmdrn               # Live multi-metric view

Start with high-level tools like top, vmstat, and iostat to identify the problem layer, then drill down with specialized tools.

References

Brendan Gregg — Linux Performance Tools
Kernel documentation — https://www.kernel.org/doc/
Sysdig, LTTng, SystemTap official documentation
eBPF / bpftrace reference guides

Get Started

Guides

Best Practices

​System Layers Overview

Application & User-Space

System Libraries & Syscalls

Kernel Subsystems

Device Drivers & Block Layer

Storage & Swap

Network Stack & NICs

Hardware

System-Wide Tools

​1. Application & User-Space Observability

​Tools

​When to Use

Application Debugging

Network Analysis

Security Auditing

Performance Tuning

​2. System Libraries & Syscall Interface

​Tools

​3. Kernel Subsystems Observability

​Core Tools

​Use Cases

​4. Device Drivers & Block Layer Observability

​Tools

iostat

iotop

blktrace

perf / tiptop

​5. Storage & Swap Observability

​Key Tools

​6. Network Stack & NIC Observability

​Network Tools

​Common Use Cases

Packet Issues

NIC Tuning

Link Problems

​7. Hardware Observability

​CPU Tools

​Memory Tools

​When to Use Hardware Tools

​8. System-Wide Observability Tools

​Tools

sar

dstat

sysdig

/proc filesystem

​Practical Use Cases

​Root Cause Analysis (RCA)

​Performance Tuning

​DevSecOps / Security

​Observability in DevSecOps

Intrusion Detection

Crypto Mining Detection

Exfiltration Detection

Hardening Validation

​Quick Command Reference

​References

Build docs developers (and LLMs) love

System Layers Overview

1. Application & User-Space Observability

Tools

When to Use

2. System Libraries & Syscall Interface

Tools

3. Kernel Subsystems Observability

Core Tools

Use Cases

4. Device Drivers & Block Layer Observability

Tools

5. Storage & Swap Observability

Key Tools

6. Network Stack & NIC Observability

Network Tools

Common Use Cases

7. Hardware Observability

CPU Tools

Memory Tools

When to Use Hardware Tools

8. System-Wide Observability Tools

Tools

Practical Use Cases

Root Cause Analysis (RCA)

Performance Tuning

DevSecOps / Security

Observability in DevSecOps

Quick Command Reference

References