Monitoring Nodes

Proper monitoring is essential for running reliable Sui nodes. This guide covers metrics, logging, and alerting.

Metrics Collection

Sui nodes expose Prometheus metrics on the metrics endpoint.

Metrics Endpoint

By default, metrics are available at:

http://localhost:9184/metrics

Configuration:

metrics-address: 0.0.0.0:9184

Viewing Metrics

View all metrics:

curl -s http://localhost:9184/metrics

Search for specific metrics:

curl -s http://localhost:9184/metrics | grep checkpoint

Key Metrics

Synchronization Metrics

Checkpoint Progress

# Highest synced checkpoint
curl -s http://localhost:9184/metrics | grep highest_synced_checkpoint

# Highest verified checkpoint
curl -s http://localhost:9184/metrics | grep highest_verified_checkpoint

# Last executed checkpoint
curl -s http://localhost:9184/metrics | grep last_executed_checkpoint

Sync Lag

# How far behind the network is this node
sum(highest_verified_checkpoint) - sum(last_executed_checkpoint)

Consensus Metrics (Validators)

Round and Commit Metrics

# Current consensus round
curl -s http://localhost:9184/metrics | grep current_round

# Committed subdags
curl -s http://localhost:9184/metrics | grep committed_subdags

# Leader timeout rate
curl -s http://localhost:9184/metrics | grep leader_timeout

Transaction Processing

# Transactions per second
curl -s http://localhost:9184/metrics | grep total_transaction_effects

# Certificate processing rate
curl -s http://localhost:9184/metrics | grep total_certificates

Performance Metrics

Database Metrics

# RocksDB metrics
curl -s http://localhost:9184/metrics | grep rocksdb

# Cache hit rates
curl -s http://localhost:9184/metrics | grep cache_hit

Execution Metrics

# Transaction execution latency
curl -s http://localhost:9184/metrics | grep execution_latency

# Checkpoint execution time
curl -s http://localhost:9184/metrics | grep checkpoint_exec

Network Metrics

P2P Metrics

# Connected peers
curl -s http://localhost:9184/metrics | grep connected_peers

# Inbound/outbound connections
curl -s http://localhost:9184/metrics | grep network_peer

# State sync metrics
curl -s http://localhost:9184/metrics | grep state_sync

RPC Metrics

# RPC request rate
curl -s http://localhost:9184/metrics | grep rpc_requests_total

# RPC latency
curl -s http://localhost:9184/metrics | grep rpc_request_latency

Logging

Log Configuration

Logs are controlled via environment variables:

# Log levels
export RUST_LOG="info,sui_core=debug,consensus=debug,jsonrpsee=error"

# Enable JSON logging
export RUST_LOG_JSON=1

# Enable backtraces
export RUST_BACKTRACE=1

Log Levels

Available log levels (from most to least verbose):

trace: Very detailed debugging information
debug: Detailed debugging information
info: General informational messages
warn: Warning messages
error: Error messages

Module-specific logging:

# Different levels for different modules
RUST_LOG="warn,sui_core=info,consensus=debug,narwhal=trace"

Viewing Logs

Systemd Deployment

# Follow logs
journalctl -u sui-node -f

# View last 100 lines
journalctl -u sui-node -n 100

# View logs since timestamp
journalctl -u sui-node --since "2024-01-01 00:00:00"

# Search logs
journalctl -u sui-node -g "error|panic"

# Follow with specific pattern
journalctl -u sui-node -f | grep checkpoint

Docker Deployment

# Follow logs
docker compose logs -f validator

# Last 100 lines
docker logs --tail 100 validator

# Logs since 10 minutes ago
docker logs --since 10m validator

# Follow with timestamps
docker compose logs -f --timestamps validator

Dynamic Log Configuration

Change log levels at runtime using the admin interface:

# View current log configuration
curl localhost:1337/logging

# Change to info level
curl localhost:1337/logging -d "info"

# Enable debug for specific module
curl localhost:1337/logging -d "info,sui_core=debug,consensus=trace"

Prometheus Setup

Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*

Configure Prometheus

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'sui-validator'
    static_configs:
      - targets: ['localhost:9184']
        labels:
          node_type: 'validator'
          network: 'mainnet'

For multiple nodes:

scrape_configs:
  - job_name: 'sui-nodes'
    static_configs:
      - targets: 
        - 'validator1.example.com:9184'
        - 'validator2.example.com:9184'
        labels:
          node_type: 'validator'
      - targets:
        - 'fullnode1.example.com:9184'
        - 'fullnode2.example.com:9184'
        labels:
          node_type: 'fullnode'

Start Prometheus

./prometheus --config.file=prometheus.yml

Access Prometheus UI at http://localhost:9090.

Grafana Setup

Install Grafana

# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Access Grafana at http://localhost:3000 (default login: admin/admin).

Add Prometheus Data Source

Navigate to Configuration > Data Sources
Click “Add data source”
Select “Prometheus”
Set URL to http://localhost:9090
Click “Save & Test”

Import Dashboards

Sui provides reference dashboards in the repository:

# Download dashboard
wget https://raw.githubusercontent.com/MystenLabs/sui/main/docker/grafana-local/dashboards/validator-dashboard.json

Import in Grafana:

Click ”+” > “Import”
Upload JSON file or paste content
Select Prometheus data source
Click “Import”

Using Docker Compose for Monitoring Stack

Create docker-compose.monitoring.yaml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml:ro
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false

volumes:
  prometheus-data:
  grafana-data:

Start the monitoring stack:

docker compose -f docker-compose.monitoring.yaml up -d

Public Dashboards

View network-wide validator metrics:

View validator stake and status:

Alerting

Prometheus Alert Rules

Create alerts.yml:

groups:
  - name: sui_node_alerts
    interval: 30s
    rules:
      # Node is down
      - alert: NodeDown
        expr: up{job="sui-nodes"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Sui node {{ $labels.instance }} is down"
          description: "Node has been down for more than 2 minutes"

      # Sync lag too high
      - alert: HighSyncLag
        expr: (highest_verified_checkpoint - last_executed_checkpoint) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High sync lag on {{ $labels.instance }}"
          description: "Node is {{ $value }} checkpoints behind"

      # Low peer count
      - alert: LowPeerCount
        expr: connected_peers < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count on {{ $labels.instance }}"
          description: "Only {{ $value }} peers connected"

      # High error rate
      - alert: HighErrorRate
        expr: rate(sui_errors_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.instance }}"

Alertmanager Configuration

Create alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '[email protected]'
        auth_password: 'password'

Essential Queries

Checkpoint Sync Status

# Time behind latest checkpoint
(timestamp(highest_verified_checkpoint) - timestamp(last_executed_checkpoint)) / 60

# Checkpoint processing rate
rate(last_executed_checkpoint[5m])

Transaction Metrics

# Transactions per second
rate(total_transaction_effects[1m])

# Transaction execution latency (p95)
histogram_quantile(0.95, rate(execution_latency_bucket[5m]))

Resource Usage

# Database size
sum(rocksdb_total_sst_files_size)

# Memory usage (if node_exporter is installed)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

Health Checks

Create automated health check scripts:

#!/bin/bash
# health-check.sh

METRICS_URL="http://localhost:9184/metrics"

# Check if node is responding
if ! curl -sf "$METRICS_URL" > /dev/null; then
    echo "ERROR: Metrics endpoint not responding"
    exit 1
fi

# Check sync lag
LAG=$(curl -s "$METRICS_URL" | grep highest_verified_checkpoint | awk '{print $2}' | \
      awk '{max=(NR==1)?$1:(max>$1?max:$1); min=(NR==1)?$1:(min<$1?min:$1)} END {print max-min}')

if [ "$LAG" -gt 100 ]; then
    echo "WARNING: Sync lag is $LAG checkpoints"
    exit 1
fi

echo "OK: Node is healthy"
exit 0

Schedule with cron:

# Run health check every 5 minutes
*/5 * * * * /opt/sui/scripts/health-check.sh

Running Nodes

Operations

Validator Guide

Monitoring Nodes

Metrics Collection

Metrics Endpoint

Viewing Metrics

Key Metrics

Synchronization Metrics

Consensus Metrics (Validators)

Performance Metrics

Network Metrics

Logging

Log Configuration

Log Levels

Viewing Logs

Dynamic Log Configuration

Prometheus Setup

Install Prometheus

Configure Prometheus

Start Prometheus

Grafana Setup

Install Grafana

Add Prometheus Data Source

Import Dashboards

Using Docker Compose for Monitoring Stack

Public Dashboards

Alerting

Prometheus Alert Rules

Alertmanager Configuration

Essential Queries

Checkpoint Sync Status

Transaction Metrics

Resource Usage

Health Checks

Build docs developers (and LLMs) love

Running Nodes

Operations

Validator Guide

​Metrics Collection

​Metrics Endpoint

​Viewing Metrics

​Key Metrics

​Synchronization Metrics

​Consensus Metrics (Validators)

​Performance Metrics

​Network Metrics

​Logging

​Log Configuration

​Log Levels

​Viewing Logs

​Dynamic Log Configuration

​Prometheus Setup

​Install Prometheus

​Configure Prometheus

​Start Prometheus

​Grafana Setup

​Install Grafana

​Add Prometheus Data Source

​Import Dashboards

​Using Docker Compose for Monitoring Stack

​Public Dashboards

​Alerting

​Prometheus Alert Rules

​Alertmanager Configuration

​Essential Queries

​Checkpoint Sync Status

​Transaction Metrics

​Resource Usage

​Health Checks

Build docs developers (and LLMs) love

Metrics Collection

Metrics Endpoint

Viewing Metrics

Key Metrics

Synchronization Metrics

Consensus Metrics (Validators)

Performance Metrics

Network Metrics

Logging

Log Configuration

Log Levels

Viewing Logs

Dynamic Log Configuration

Prometheus Setup

Install Prometheus

Configure Prometheus

Start Prometheus

Grafana Setup

Install Grafana

Add Prometheus Data Source

Import Dashboards

Using Docker Compose for Monitoring Stack

Public Dashboards

Alerting

Prometheus Alert Rules

Alertmanager Configuration

Essential Queries

Checkpoint Sync Status

Transaction Metrics

Resource Usage

Health Checks