IOTA nodes expose comprehensive metrics for monitoring performance, health, and network participation. This guide covers setting up monitoring infrastructure and understanding key metrics.
Metrics Endpoint
Nodes expose Prometheus-compatible metrics on the configured metrics address:
# In node.yaml
metrics-address: "0.0.0.0:9184"
Access metrics at: http://your-node:9184/metrics
Quick Health Check
Verify your node is exposing metrics:
curl http://localhost:9184/metrics
You should see Prometheus-formatted metrics output.
Prometheus Setup
Install Prometheus
# Using Docker
docker pull prom/prometheus:latest
# Or download binary from prometheus.io
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
Configure Prometheus
Create prometheus.yml:global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'iota-node'
static_configs:
- targets: ['localhost:9184']
labels:
instance: 'node-1'
chain: 'mainnet'
Start Prometheus
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest
Verify data collection
Access Prometheus UI at http://localhost:9090 and query:
Metrics Push Service
For centralized monitoring, configure metrics push:
metrics:
# Push interval in seconds (default: 60)
push-interval-seconds: 60
# Remote push endpoint
push-url: "https://metrics-gateway.example.com/push"
The node will periodically push metrics to the configured endpoint using authenticated requests.
Key Metrics Reference
Node Uptime and Version
# Node uptime in seconds
uptime
# Labels include:
# - process: "validator" or "fullnode"
# - version: binary version
# - chain_identifier: network identifier
# - os_version: operating system
# - is_docker: whether running in Docker
Task and Future Monitoring
# Number of running tasks
monitored_tasks{callsite="..."}
# Number of pending futures
monitored_futures{callsite="..."}
# Active duration of futures in nanoseconds
monitored_future_active_duration_ns{name="..."}
Channel Metrics
# Items in flight in channels
monitored_channel_inflight{name="..."}
# Items sent through channels
monitored_channel_sent{name="..."}
# Items received from channels
monitored_channel_received{name="..."}
Scope Monitoring
# Number of scope entrances
monitored_scope_entrance{name="..."}
# Total scope iterations
monitored_scope_iterations{name="..."}
# Scope duration in nanoseconds
monitored_scope_duration_ns{name="..."}
Thread Stall Detection
# Thread stall duration histogram
thread_stall_duration_sec_bucket
thread_stall_duration_sec_sum
thread_stall_duration_sec_count
System Invariant Violations
# Count of system invariant violations
system_invariant_violations{name="..."}
Any non-zero value for system_invariant_violations indicates a serious issue that requires immediate investigation.
gRPC API Metrics
# In-flight gRPC requests
inflight_grpc{path="..."}
# Total gRPC requests
grpc_requests{path="...", status="..."}
# gRPC request latency histogram
grpc_request_latency_bucket{path="..."}
grpc_request_latency_sum{path="..."}
grpc_request_latency_count{path="..."}
zkLogin Metrics
# JWK requests by provider
jwk_requests{provider="..."}
# JWK request errors
jwk_request_errors{provider="..."}
# Total JWKs
total_jwks{provider="..."}
# Invalid JWKs
invalid_jwks{provider="..."}
# Unique JWKs
unique_jwks{provider="..."}
Hardware Metrics
The iota-metrics crate includes hardware monitoring capabilities:
# CPU usage, memory, disk I/O
# Network interface statistics
# System load averages
These metrics are automatically collected when enabled in the node configuration.
Grafana Dashboard
Install Grafana
docker run -d \
--name grafana \
-p 3000:3000 \
grafana/grafana:latest
Add Prometheus data source
- Access Grafana at
http://localhost:3000 (default credentials: admin/admin)
- Go to Configuration > Data Sources
- Add Prometheus with URL
http://prometheus:9090
Create dashboards
Create custom dashboards tracking:
- Node uptime and version
- gRPC request rates and latency
- Channel queue depths
- Thread stall events
- System resource utilization
Example Prometheus Queries
Request Rate (QPS)
# gRPC requests per second by path
rate(grpc_requests[5m])
Error Rate
# gRPC error rate by status code
rate(grpc_requests{status!="Ok"}[5m])
Latency Percentiles
# 95th percentile gRPC latency
histogram_quantile(0.95,
rate(grpc_request_latency_bucket[5m])
)
# 99th percentile
histogram_quantile(0.99,
rate(grpc_request_latency_bucket[5m])
)
Channel Backlog
# Items waiting in channels
monitored_channel_inflight
# Channel throughput
rate(monitored_channel_sent[5m])
Thread Health
# Thread stalls per minute
rate(thread_stall_duration_sec_count[1m]) * 60
# Average stall duration
rate(thread_stall_duration_sec_sum[5m]) /
rate(thread_stall_duration_sec_count[5m])
Alerting Rules
Create Prometheus alerting rules for critical conditions:
groups:
- name: iota_node_alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="iota-node"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "IOTA node is down"
- alert: HighGrpcErrorRate
expr: |
rate(grpc_requests{status!="Ok"}[5m]) /
rate(grpc_requests[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High gRPC error rate detected"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(grpc_request_latency_bucket[5m])
) > 5
for: 15m
labels:
severity: warning
annotations:
summary: "High request latency (p95 > 5s)"
- alert: ThreadStalls
expr: |
rate(thread_stall_duration_sec_count[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Frequent thread stalls detected"
- alert: InvariantViolation
expr: |
increase(system_invariant_violations[5m]) > 0
labels:
severity: critical
annotations:
summary: "System invariant violation detected"
Admin Interface
The admin interface provides runtime control and diagnostics:
# Default: 127.0.0.1:1337 (localhost only)
admin-interface-address: "127.0.0.1:1337"
The admin interface should only be accessible from localhost or via secure channels. Never expose it publicly.
Logging Configuration
The admin interface allows dynamic tracing and logging configuration:
# Access admin interface
curl http://localhost:1337/admin/logging
Network Metrics
Monitor P2P network health:
# Network message metrics
# Peer connection counts
# State sync progress
# Checkpoint download rates
These metrics are exposed through the P2P subsystem’s metrics integration.
Best Practices
- Set up alerting: Don’t rely on manual monitoring - configure alerts for critical conditions
- Monitor trends: Track metrics over time to identify degradation before it becomes critical
- Correlate metrics: Look at multiple metrics together to diagnose issues
- Regular review: Periodically review dashboards and adjust thresholds
- Retention: Configure appropriate metrics retention based on your needs
- Security: Protect metrics endpoints with authentication in production
Next Steps