Monitoring and Metrics

IOTA nodes expose comprehensive metrics for monitoring performance, health, and network participation. This guide covers setting up monitoring infrastructure and understanding key metrics.

Metrics Endpoint

Nodes expose Prometheus-compatible metrics on the configured metrics address:

# In node.yaml
metrics-address: "0.0.0.0:9184"

Access metrics at: http://your-node:9184/metrics

Quick Health Check

Verify your node is exposing metrics:

curl http://localhost:9184/metrics

You should see Prometheus-formatted metrics output.

Prometheus Setup

Install Prometheus

# Using Docker
docker pull prom/prometheus:latest

# Or download binary from prometheus.io
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

Configure Prometheus

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'iota-node'
    static_configs:
      - targets: ['localhost:9184']
        labels:
          instance: 'node-1'
          chain: 'mainnet'

Start Prometheus

docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest

Verify data collection

Access Prometheus UI at http://localhost:9090 and query:

uptime

Metrics Push Service

For centralized monitoring, configure metrics push:

metrics:
  # Push interval in seconds (default: 60)
  push-interval-seconds: 60
  
  # Remote push endpoint
  push-url: "https://metrics-gateway.example.com/push"

The node will periodically push metrics to the configured endpoint using authenticated requests.

Key Metrics Reference

Node Uptime and Version

# Node uptime in seconds
uptime

# Labels include:
# - process: "validator" or "fullnode"
# - version: binary version
# - chain_identifier: network identifier
# - os_version: operating system
# - is_docker: whether running in Docker

Task and Future Monitoring

# Number of running tasks
monitored_tasks{callsite="..."}

# Number of pending futures
monitored_futures{callsite="..."}

# Active duration of futures in nanoseconds
monitored_future_active_duration_ns{name="..."}

Channel Metrics

# Items in flight in channels
monitored_channel_inflight{name="..."}

# Items sent through channels
monitored_channel_sent{name="..."}

# Items received from channels
monitored_channel_received{name="..."}

Scope Monitoring

# Number of scope entrances
monitored_scope_entrance{name="..."}

# Total scope iterations
monitored_scope_iterations{name="..."}

# Scope duration in nanoseconds
monitored_scope_duration_ns{name="..."}

Thread Stall Detection

# Thread stall duration histogram
thread_stall_duration_sec_bucket
thread_stall_duration_sec_sum
thread_stall_duration_sec_count

System Invariant Violations

# Count of system invariant violations
system_invariant_violations{name="..."}

Any non-zero value for system_invariant_violations indicates a serious issue that requires immediate investigation.

gRPC API Metrics

# In-flight gRPC requests
inflight_grpc{path="..."}

# Total gRPC requests
grpc_requests{path="...", status="..."}

# gRPC request latency histogram
grpc_request_latency_bucket{path="..."}
grpc_request_latency_sum{path="..."}
grpc_request_latency_count{path="..."}

zkLogin Metrics

# JWK requests by provider
jwk_requests{provider="..."}

# JWK request errors
jwk_request_errors{provider="..."}

# Total JWKs
total_jwks{provider="..."}

# Invalid JWKs
invalid_jwks{provider="..."}

# Unique JWKs
unique_jwks{provider="..."}

Hardware Metrics

The iota-metrics crate includes hardware monitoring capabilities:

# CPU usage, memory, disk I/O
# Network interface statistics
# System load averages

These metrics are automatically collected when enabled in the node configuration.

Grafana Dashboard

Install Grafana

docker run -d \
  --name grafana \
  -p 3000:3000 \
  grafana/grafana:latest

Add Prometheus data source

Access Grafana at http://localhost:3000 (default credentials: admin/admin)
Go to Configuration > Data Sources
Add Prometheus with URL http://prometheus:9090

Create dashboards

Create custom dashboards tracking:

Node uptime and version
gRPC request rates and latency
Channel queue depths
Thread stall events
System resource utilization

Example Prometheus Queries

Request Rate (QPS)

# gRPC requests per second by path
rate(grpc_requests[5m])

Error Rate

# gRPC error rate by status code
rate(grpc_requests{status!="Ok"}[5m])

Latency Percentiles

# 95th percentile gRPC latency
histogram_quantile(0.95, 
  rate(grpc_request_latency_bucket[5m])
)

# 99th percentile
histogram_quantile(0.99,
  rate(grpc_request_latency_bucket[5m])
)

Channel Backlog

# Items waiting in channels
monitored_channel_inflight

# Channel throughput
rate(monitored_channel_sent[5m])

Thread Health

# Thread stalls per minute
rate(thread_stall_duration_sec_count[1m]) * 60

# Average stall duration
rate(thread_stall_duration_sec_sum[5m]) / 
rate(thread_stall_duration_sec_count[5m])

Alerting Rules

Create Prometheus alerting rules for critical conditions:

groups:
  - name: iota_node_alerts
    interval: 30s
    rules:
      - alert: NodeDown
        expr: up{job="iota-node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "IOTA node is down"
      
      - alert: HighGrpcErrorRate
        expr: |
          rate(grpc_requests{status!="Ok"}[5m]) / 
          rate(grpc_requests[5m]) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High gRPC error rate detected"
      
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(grpc_request_latency_bucket[5m])
          ) > 5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High request latency (p95 > 5s)"
      
      - alert: ThreadStalls
        expr: |
          rate(thread_stall_duration_sec_count[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Frequent thread stalls detected"
      
      - alert: InvariantViolation
        expr: |
          increase(system_invariant_violations[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "System invariant violation detected"

Admin Interface

The admin interface provides runtime control and diagnostics:

# Default: 127.0.0.1:1337 (localhost only)
admin-interface-address: "127.0.0.1:1337"

The admin interface should only be accessible from localhost or via secure channels. Never expose it publicly.

Logging Configuration

The admin interface allows dynamic tracing and logging configuration:

# Access admin interface
curl http://localhost:1337/admin/logging

Network Metrics

Monitor P2P network health:

# Network message metrics
# Peer connection counts
# State sync progress
# Checkpoint download rates

These metrics are exposed through the P2P subsystem’s metrics integration.

Best Practices

Set up alerting: Don’t rely on manual monitoring - configure alerts for critical conditions
Monitor trends: Track metrics over time to identify degradation before it becomes critical
Correlate metrics: Look at multiple metrics together to diagnose issues
Regular review: Periodically review dashboards and adjust thresholds
Retention: Configure appropriate metrics retention based on your needs
Security: Protect metrics endpoints with authentication in production

Next Steps

Troubleshoot common issues using metrics data
Learn about node configuration options

Getting Started

Core Concepts

Building on IOTA

Node Operations

Network Services

Monitoring and Metrics

Metrics Endpoint

Quick Health Check

Prometheus Setup

Metrics Push Service

Key Metrics Reference

Node Uptime and Version

Task and Future Monitoring

Channel Metrics

Scope Monitoring

Thread Stall Detection

System Invariant Violations

gRPC API Metrics

zkLogin Metrics

Hardware Metrics

Grafana Dashboard

Example Prometheus Queries

Request Rate (QPS)

Error Rate

Latency Percentiles

Channel Backlog

Thread Health

Alerting Rules

Admin Interface

Logging Configuration

Network Metrics

Best Practices

Next Steps

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Building on IOTA

Node Operations

Network Services

​Metrics Endpoint

​Quick Health Check

​Prometheus Setup

​Metrics Push Service

​Key Metrics Reference

​Node Uptime and Version

​Task and Future Monitoring

​Channel Metrics

​Scope Monitoring

​Thread Stall Detection

​System Invariant Violations

​gRPC API Metrics

​zkLogin Metrics

​Hardware Metrics

​Grafana Dashboard

​Example Prometheus Queries

​Request Rate (QPS)

​Error Rate

​Latency Percentiles

​Channel Backlog

​Thread Health

​Alerting Rules

​Admin Interface

​Logging Configuration

​Network Metrics

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Metrics Endpoint

Quick Health Check

Prometheus Setup

Metrics Push Service

Key Metrics Reference

Node Uptime and Version

Task and Future Monitoring

Channel Metrics

Scope Monitoring

Thread Stall Detection

System Invariant Violations

gRPC API Metrics

zkLogin Metrics

Hardware Metrics

Grafana Dashboard

Example Prometheus Queries

Request Rate (QPS)

Error Rate

Latency Percentiles

Channel Backlog

Thread Health

Alerting Rules

Admin Interface

Logging Configuration

Network Metrics

Best Practices

Next Steps