Skip to main content

Overview

NMIS collects and stores performance metrics as time-series data using RRDtool (Round Robin Database). This provides efficient storage with automatic data aging and aggregation, enabling both real-time monitoring and historical trend analysis.

RRD Storage

Efficient circular buffer storage with fixed size and automatic aggregation

Data Resolution

Multiple resolution levels from 5-minute to yearly averages

Polling Engine

Automated collection cycles with configurable intervals

Graphing

Built-in graphing of collected metrics with customizable views

RRDtool Storage Architecture

Round Robin Database Concept

RRD files use a circular buffer structure that:
  • Fixed Size: Never grows beyond configured size
  • Automatic Aging: Oldest data automatically overwritten
  • Consolidation: High-resolution data aggregated over time
  • Efficiency: O(1) read/write operations regardless of data age
# From rrdfunc.pm:29-68
package NMISNG::rrdfunc;

our $VERSION = "9.5.1";

use Statistics::Lite;
use POSIX qw();  # for strftime

# Initialize RRDs module
sub require_RRDs {
    my (%args) = @_;
    state $RRD_included = 0;
    
    if( !$RRD_included ) {
        $RRD_included = 1;
        require RRDs;
        RRDs->import;
    }
}

# Track module errors
my $_last_error;
sub getRRDerror {
    return $_last_error;
}

Data Storage Locations

RRD files are organized by node and data type:
# Default RRD storage structure
/var/nmis9/database/
├── nodes/
   ├── router1/
   ├── health/
   ├── cpu.rrd
   ├── memory.rrd
   └── health.rrd
   ├── interface/
   ├── GigabitEthernet0_0-ifInOctets.rrd
   ├── GigabitEthernet0_0-ifOutOctets.rrd
   └── GigabitEthernet0_1-ifInOctets.rrd
   └── pkts/
       └── GigabitEthernet0_0-pkts.rrd
   └── switch1/
       └── ...

Data Collection Process

Polling Cycles

NMIS operates on multiple collection cycles: Update Cycle (default 5 minutes):
  • System information (uptime, name, location)
  • Catchall data (global device metrics)
  • Interface discovery
Collect Cycle (default 5 minutes):
  • Interface statistics (octets, errors, discards)
  • CPU and memory utilization
  • Environmental sensors
  • Protocol statistics
Services Cycle (configurable):
  • Service availability checks
  • Response time measurements
Polling intervals can be customized per node or group using polling policies. Critical devices may poll every 1-2 minutes, while less important ones every 15-30 minutes.

Data Source Types

RRD databases contain multiple data sources (DS), each with a type:
GAUGE
DS type
Value that can increase or decrease (CPU %, temperature, voltage)
COUNTER
DS type
Ever-increasing counter that wraps (interface octets, packet counts)
DERIVE
DS type
Counter that can increase or decrease (signed counter)
ABSOLUTE
DS type
Counter that resets to zero after each read

Round Robin Archives (RRAs)

Each RRD file contains multiple archives with different resolutions:
# From rrdfunc.pm:70-130
sub getRRDasHash {
    my %args = @_;
    my $db = $args{database};
    
    my $minhr = (defined $args{hour_from}? $args{hour_from} : 0);
    my $maxhr = (defined $args{hour_to}? $args{hour_to} :  24) ;
    my $wantedresolution = $args{resolution};
    
    my @rrdargs = ($db, $args{mode});
    my ($bucketsize, $resolution);
    
    if (defined($wantedresolution) && $wantedresolution > 0) {
        # Determine native resolutions available
        my ($error, @available) = getRRDResolutions($db, $args{mode});
        return ({},[], { error => $error }) if ($error);
        
        # Match desired resolution to available RRAs
        if (grep($_ == $wantedresolution, @available)) {
            $resolution = $wantedresolution;
        }
        elsif ( $wantedresolution % $available[0] == 0) {
            # Bucketize ourselves
            $bucketsize = $wantedresolution / $available[0];
            $resolution = $available[0];
        }
    }
}
Standard NMIS RRA Configuration:
ArchiveStepRowsPeriodConsolidation
RRA:AVERAGE5 min5762 daysAverage of raw samples
RRA:AVERAGE30 min6722 weeksAverage of 6 samples
RRA:AVERAGE2 hours7442 monthsAverage of 24 samples
RRA:AVERAGE1 day14604 yearsAverage of 288 samples

Data Collection Configuration

Polling Intervals

# Configure node polling interval
nmis-cli act=set node=router1 poll_interval=300

# Set faster polling for critical device
nmis-cli act=set node=core-switch1 poll_interval=60

# Slower polling for edge devices
nmis-cli act=set node=remote-router poll_interval=900

Data Retention

Retention is controlled by RRA configuration in model files:
{
  "database": {
    "type": {
      "ifInOctets": {
        "rrd": {
          "step": 300,
          "heartbeat": 900,
          "ds": {
            "ifInOctets": {
              "type": "COUNTER",
              "min": "0",
              "max": "U"
            }
          },
          "rra": [
            { "cf": "AVERAGE", "xff": 0.5, "steps": 1, "rows": 576 },
            { "cf": "AVERAGE", "xff": 0.5, "steps": 6, "rows": 672 },
            { "cf": "AVERAGE", "xff": 0.5, "steps": 24, "rows": 744 },
            { "cf": "AVERAGE", "xff": 0.5, "steps": 288, "rows": 1460 }
          ]
        }
      }
    }
  }
}
RRA Parameters:
  • CF: Consolidation Function (AVERAGE, MIN, MAX, LAST)
  • XFF: X-Files Factor (0.5 = max 50% unknown data)
  • Steps: Number of intervals to consolidate
  • Rows: Number of consolidated rows to keep

Retrieving Performance Data

Using rrdfunc Module

# From rrdfunc.pm:70-200
sub getRRDasHash {
    my %args = @_;
    my $db = $args{database};
    
    return ({},[], { error => "database required!"}) 
        if (!$db or !-f $db);
    
    my @rrdargs = ($db, $args{mode});
    push @rrdargs, ("--start",$args{start},"--end",$args{end});
    
    my ($begin,$step,$name,$data) = RRDs::fetch(@rrdargs);
    my @dsnames = @$name if (defined $name);
    my %s;
    my $time = $begin;
    my $rowswithdata;
    
    # Loop over readings over time
    for(my $row = 0; $row <= $#{$data}; ++$row, $time += $step) {
        my $thisrow = $data->[$row];
        my $datapresent;
        
        # Loop over datasets per reading
        for(my $dsidx = 0; $dsidx <= $#{$thisrow}; ++$dsidx) {
            $s{$time}->{ $dsnames[$dsidx] } = $thisrow->[$dsidx];
            $datapresent ||= 1 if (defined $thisrow->[$dsidx]);
        }
        
        ++$rowswithdata if ($datapresent);
    }
    
    return (\%s, \@dsnames, { 
        step => $step, 
        start => $begin, 
        end => $time,
        rows => scalar @$data, 
        rows_with_data => $rowswithdata 
    });
}

CLI Data Retrieval

# Get latest values
nmis-cli act=get-rrd node=router1 inventory=cpu

# Get data for time range
nmis-cli act=get-rrd node=router1 \
  inventory=interface \
  index=GigabitEthernet0/0 \
  start="-1day" \
  end="now"

# Export to CSV
nmis-cli act=export-rrd node=router1 \
  inventory=cpu \
  start="-1week" \
  format=csv > cpu-data.csv

RRD File Management

Creating RRD Files

RRD files are automatically created during first data collection based on model definitions. Manual creation:
# Create RRD from command line
rrdtool create /path/to/file.rrd \
  --step 300 \
  DS:metric:GAUGE:900:0:U \
  RRA:AVERAGE:0.5:1:576 \
  RRA:AVERAGE:0.5:6:672 \
  RRA:AVERAGE:0.5:24:744 \
  RRA:AVERAGE:0.5:288:1460

Updating RRD Files

# Manual RRD update
rrdtool update /path/to/file.rrd N:value

# Update with specific timestamp
rrdtool update /path/to/file.rrd 1234567890:42

RRD File Information

# View RRD structure
rrdtool info /var/nmis9/database/nodes/router1/health/cpu.rrd

# Check last update time
rrdtool lastupdate /var/nmis9/database/nodes/router1/health/cpu.rrd

# Dump RRD data to XML
rrdtool dump /path/to/file.rrd > file.xml

# Restore from XML
rrdtool restore file.xml /path/to/file.rrd

Resizing RRD Files

Change retention periods by resizing RRAs:
# Use NMIS resize script
/usr/local/nmis9/admin/rrd_resize.pl \
  node=router1 \
  type=health \
  file=cpu.rrd \
  archive=2 \
  rows=1488

# Or use rrdtool directly
rrdtool resize /path/to/file.rrd 2 GROW 100
rrdtool resize /path/to/file.rrd 2 SHRINK 100

Performance Monitoring

Common Metrics Collected

Interface Metrics:
  • ifInOctets / ifOutOctets (traffic volume)
  • ifInUcastPkts / ifOutUcastPkts (packets)
  • ifInErrors / ifOutErrors
  • ifInDiscards / ifOutDiscards
System Health:
  • avgBusy (CPU utilization)
  • MemoryUsedPROC / MemoryFreePROC
  • bufferFail, bufferSwap (buffer statistics)
Environmental:
  • Temperature sensors
  • Fan speeds
  • Power supply status
  • Voltage levels
Application:
  • CBQoS (Cisco class-based QoS)
  • IP SLA metrics
  • VPN statistics
  • Call Manager metrics

Data Aggregation

NMIS automatically aggregates high-resolution data:
Raw (5 min) --> 30 min avg --> 2 hour avg --> 1 day avg
  576 rows       672 rows       744 rows      1460 rows
   (2 days)      (2 weeks)      (2 months)    (4 years)
Aggregation functions:
  • AVERAGE: Mean of values in period
  • MIN: Minimum value in period
  • MAX: Maximum value in period
  • LAST: Most recent value in period

Graphing Performance Data

Graph Types

Interface Graphs:
  • Traffic (bits/octets per second)
  • Packets per second
  • Errors and discards
  • Utilization percentage
Health Graphs:
  • CPU utilization over time
  • Memory usage trends
  • Buffer statistics
  • Response time
Custom Graphs:
  • Defined in model files
  • Support for VDEF, CDEF operations
  • Multiple data sources
  • Stacked and overlaid views

Accessing Graphs

# Generate graph via CLI
nmis-cli act=graph node=router1 \
  graph=cpu \
  start="-1day" \
  end="now" \
  output=/tmp/cpu-graph.png

# Generate multiple graphs
nmis-cli act=graph-all node=router1 \
  start="-1week" \
  outputdir=/tmp/graphs/

Data Export and Analysis

Export Formats

# Export to CSV
rrdtool xport \
  --start -1day \
  --end now \
  --step 300 \
  DEF:metric=/path/to/file.rrd:ds:AVERAGE \
  XPORT:metric:"Metric Name" \
  > data.csv

# Export using NMIS
nmis-cli act=export-data node=router1 \
  type=interface \
  format=csv \
  start="-1month" \
  > interface-data.csv

Statistical Analysis

# Get statistics from RRD
rrdtool graph /dev/null \
  --start -1day --end now \
  DEF:metric=/path/to/file.rrd:ds:AVERAGE \
  VDEF:avg=metric,AVERAGE \
  VDEF:min=metric,MINIMUM \
  VDEF:max=metric,MAXIMUM \
  PRINT:avg:"Average: %6.2lf" \
  PRINT:min:"Minimum: %6.2lf" \
  PRINT:max:"Maximum: %6.2lf"

Troubleshooting

Missing Data

# Check last update
rrdtool lastupdate /path/to/file.rrd

# Verify file integrity
rrdtool info /path/to/file.rrd | grep last_update

# Check for unknown data
rrdtool fetch /path/to/file.rrd AVERAGE -s -1hour | grep nan

RRD Errors

“illegal attempt to update using time X”
  • Trying to update with timestamp older than last update
  • Fix: Ensure system time is correct
“expected X data source readings”
  • Wrong number of values provided
  • Fix: Match update to DS count in RRD
“unknown RRA”
  • Requesting non-existent archive
  • Fix: Use rrdtool info to see available RRAs

Performance Issues

# Check RRD file size
ls -lh /var/nmis9/database/nodes/*/health/*.rrd

# Count RRD files
find /var/nmis9/database -name "*.rrd" | wc -l

# Monitor I/O
iotop -o -P -p $(pgrep rrdtool)

# Check for filesystem issues
df -h /var/nmis9/database

Best Practices

  1. Plan Retention: Balance storage space vs historical needs
  2. Monitor Disk Space: RRD files don’t shrink but can accumulate
  3. Backup Strategically: Focus on configuration, RRDs are regenerated
  4. Regular Validation: Check for stale RRD files from deleted nodes
  5. Optimize Polling: Don’t over-poll; 5 minutes is usually sufficient
  6. Use Compression: Modern filesystems with compression help
  7. Archive Historical Data: Export old data before RRD ages it out

Advanced Topics

Custom RRD Definitions

Create custom data sources in model files:
{
  "database": {
    "type": {
      "custom-metric": {
        "rrd": {
          "step": 300,
          "ds": {
            "mymetric": {
              "type": "GAUGE",
              "min": 0,
              "max": 100
            }
          }
        }
      }
    }
  }
}

Data Manipulation

Use CDEF and VDEF for calculations:
# Calculate average rate
DEF:bytes=file.rrd:ds:AVERAGE
CDEF:bits=bytes,8,*
VDEF:avgbits=bits,AVERAGE

Next Steps

Event Management

Configure thresholds and alerts based on performance data

SNMP Monitoring

Understand how data is collected via SNMP

Build docs developers (and LLMs) love