Skip to main content

Overview

NMIS provides a comprehensive event management system that detects problems, tracks their lifecycle, escalates based on severity and duration, and delivers notifications through multiple channels. The event system is the core of NMIS’s alerting and notification capabilities.

Event Detection

Automatic detection of down, degraded, and threshold violations

Escalation

Multi-level escalation based on severity and event duration

Notifications

Email, syslog, SNMP traps, and custom notification methods

Event Lifecycle

Track events from creation through acknowledgment to closure

Event Object Model

Events are first-class objects in NMIS with defined properties and lifecycle:
# From Event.pm:30-83
package NMISNG::Event;

our $VERSION = "9.6.5";

# Known event attributes
my %known_attrs = (
    _id            => 1,
    ack            => 1,
    active         => 1,
    cluster_id     => 1,
    context        => 1,
    details        => 1,
    element        => 1,
    escalate       => 1,
    event_previous => 1,
    expire_at      => 1,
    historic       => 1,
    inventory_id   => 1,
    lastupdate     => 1,
    level_previous => 1,
    logged         => 1,
    node_name      => 1,
    node_uuid      => 1,
    notify         => 1,
    startdate      => 1,
    stateless      => 1,
    user           => 1,
    configuration  => {"group" => 1}
);

sub new {
    my ( $class, %args ) = @_;
    confess "nmisng required" if ( ref( $args{nmisng} ) ne "NMISNG" );
    
    # Need enough data to find in DB
    if ( !$args{_id} && ( !$args{node_uuid} && !$args{event} ) ) {
        confess "not enough info to create event";
    }
    
    my $self = bless({
        _nmisng => $args{nmisng},
        data    => \%args
    }, $class);
    
    return $self;
}

Event Properties

event
string
required
Event name (e.g., “Node Down”, “Interface Down”, “Proactive CPU”)
node_name
string
required
Name of affected node
node_uuid
string
required
UUID of affected node
level
string
required
Event severity: Fatal, Critical, Major, Minor, Warning, Normal
element
string
Affected element (interface name, disk, process, etc.)
details
string
Additional event information and context
active
boolean
default:"true"
Whether event is currently active (not resolved)
ack
boolean
default:"false"
Whether event has been acknowledged by operator
escalate
number
default:"-1"
Current escalation level (-1 = not escalated)
stateless
boolean
default:"false"
Event resets after dampening period (for traps, alerts)

Event Types

System Events

Node Availability:
  • Node Down: Device unreachable via ping
  • Node Up: Device becomes reachable
  • SNMP Down: SNMP polling failed
  • WMI Down: WMI collection failed
  • Node Polling Failover: Failover to backup host
Interface Events:
  • Interface Down: Interface operStatus changed to down
  • Interface Up: Interface recovered
Service Events:
  • Service Down: Service check failed
  • Service Up: Service recovered
  • Service Degraded: Service slow but responding

Threshold Events (Proactive)

Generated when metrics exceed defined thresholds:
  • Proactive CPU: CPU utilization threshold exceeded
  • Proactive Memory: Memory usage threshold exceeded
  • Proactive Interface: Interface utilization threshold
  • Proactive Disk: Disk space threshold
  • Proactive [Custom]: Custom threshold violations
# From Event.pm:368-396
if ( $self->event =~ /Proactive/ ) {
    my ( $value, $reset ) = @args{"value", "reset"};
    if ( defined $value and defined $reset ) {
        # Only clear if threshold cleared by 10%
        # For thresholds where high = good (default 1.1)
        # For thresholds where low = good (default 0.9)
        my $cutoff = $reset * (
              $value >= $reset
            ? $C->{'threshold_falling_reset_dampening'}
            : $C->{'threshold_rising_reset_dampening'}
        );
        
        if ( $value >= $reset && $value <= $cutoff ) {
            $S->nmisng->log->debug(
                "Proactive Event value $value too low for dampening limit $cutoff. Not closing.");
            return;
        }
    }
    $new_event = $self->event . " Closed";
}

Alert Events

Custom alerts generated by external systems or scripts:
  • Alert: [Custom]: Custom alert conditions
  • Alert names prefixed with “Alert:” or “Alert :“

SNMP Trap Events

Generated from received SNMP traps:
  • TRAP: Generic SNMP trap event
  • Traps are typically stateless and auto-acknowledge when cleared

Event Lifecycle

Event States

Events progress through defined states:
  1. Created: Event detected and added to database
  2. Active: Event is current and unresolved (active=1, historic=0)
  3. Acknowledged: Operator acknowledged (ack=1)
  4. Resolved: Condition cleared (active=0, historic=0)
  5. Historic: Event archived for history (historic=1)
# From Event.pm:282-326
sub check {
    my ( $self, %args ) = @_;
    my $S = $args{sys};
    
    my $exists = $self->exists();
    
    # Event exists and is active - create UP event
    if ( $exists && $self->active ) {
        # Compute outage period
        my $outage = NMISNG::Util::convertSecsHours( time() - $self->startdate );
        
        # Determine UP event name
        if ( $self->event eq "Node Down" ) {
            $new_event = "Node Up";
        }
        elsif ( $self->event eq "Interface Down" ) {
            $new_event = "Interface Up";
        }
        elsif ( $self->event =~ /Proactive/ ) {
            $new_event = $self->event . " Closed";
        }
        elsif ( $self->event =~ /down/i ) {
            $new_event =~ s/down/Up/i;
        }
        
        $details .= ( $details ? " " : "" ) . "Time=$outage";
        
        # Mark event inactive, notify, then delete
        $self->active(0);
        $self->event($new_event);
        $self->details($details);
        $self->level($level);
        
        my $error = $self->save();
    }
}

State Transitions

Event Creation and Management

Creating Events

# From Events.pm:125-153
sub eventAdd {
    my ( $self, %args ) = @_;
    
    my $node = $args{node};
    return "Cannot create event without node object" 
        if ( ref($node) ne 'NMISNG::Node' );
        
    $args{node_name} = $node->name;
    $args{node_uuid} = $node->uuid;
    
    my $event_obj = $self->event(%args);
    return $event_obj->save();
}

Using Events API

# Create event via node object
$node->eventAdd(
    event   => "Proactive CPU",
    level   => "Major",
    element => "cpu",
    details => "CPU utilization 95% exceeds threshold 90%"
);

# Check if event exists
if ($node->eventExist("Node Down")) {
    # Node is down
}

# Load specific event
my ($error, $event) = $node->eventLoad(
    event   => "Interface Down",
    element => "GigabitEthernet0/0"
);

CLI Event Operations

# List active events
nmis-cli act=list-events

# List events for node
nmis-cli act=list-events node=router1

# Show event details
nmis-cli act=show-event id=507f1f77bcf86cd799439011

# Acknowledge event
nmis-cli act=ack-event id=507f1f77bcf86cd799439011 \
  user="admin" \
  ack=true

# Clear event manually
nmis-cli act=clear-event node=router1 event="Node Down"

Event Escalation

Escalation Levels

Events escalate based on duration and severity:
# From Event.pm:897-948
sub save {
    my ( $self,  %args )  = @_;
    
    # Set defaults for new events
    if ( !$update ) {
        $self->{data}{active}    //= 1;
        $self->{data}{historic}  //= 0;
        $self->{data}{startdate} //= time;
        $self->{data}{ack}       //= 0;
        $self->{data}{escalate}  //= -1;
        $self->{data}{notify}    //= "";
        $self->{data}{stateless} //= 0;
    }
}
Escalation Levels:
  • -1: Not yet escalated
  • 0: Initial escalation
  • 1-N: Subsequent escalation levels
Escalation Process:
  1. Event created with escalate=-1
  2. First escalation after initial delay: escalate=0
  3. Subsequent escalations: escalate++
  4. Different contacts notified at each level
  5. Stops at maximum configured level

Escalation Configuration

Configure in escalation policies:
{
  "escalate": {
    "0": {
      "level": "Fatal,Critical,Major",
      "start": 5,
      "frequency": 60,
      "contact": "admin"
    },
    "1": {
      "level": "Fatal,Critical",
      "start": 30,
      "frequency": 120,
      "contact": "oncall"
    },
    "2": {
      "level": "Fatal",
      "start": 120,
      "frequency": 240,
      "contact": "manager"
    }
  }
}
Parameters:
  • level: Event severity levels this escalation applies to
  • start: Minutes after event creation before first notification
  • frequency: Minutes between repeat notifications
  • contact: Contact group to notify

Event Acknowledgment

# From Event.pm:206-280
sub acknowledge {
    my ( $self, %args ) = @_;
    my $ack  = $args{ack};
    my $user = $args{user};
    
    $ack = NMISNG::Util::getbool($ack);
    
    # Load current event state
    if ( my $error = $self->load() ) {
        $self->nmisng->log->error( "cannot find event id:" . $self->_id );
        return "cannot find event id:" . $self->_id;
    }
    return if ( !$self->active );
    
    # TRAP events are deleted when acknowledged
    if ( $ack and !$self->ack and $self->event eq "TRAP" ) {
        if ( my $error = $self->delete() ) {
            $self->nmisng->log->error("failed to delete event: $error");
        }
        $self->log(
            event => "deleted event: " . $self->event,
            level => "Normal",
        ) if ($wantlog);
    }
    else {
        # Nothing to do if ack state unchanged
        if ( $ack != $self->ack ) {
            $self->ack($ack);
            $self->user($user);
            my $error = $self->save( update => 1 );
            
            $self->log(
                level   => "Normal",
                details => "acknowledge=$ack ($user)"
            ) if $wantlog;
        }
    }
}

Acknowledgment via CLI

# Acknowledge single event
nmis-cli act=ack-event node=router1 \
  event="Node Down" \
  user="john.doe" \
  ack=true

# Acknowledge by event ID
nmis-cli act=ack-event id=507f1f77bcf86cd799439011 \
  user="admin" \
  ack=true

# Unacknowledge (re-enable notifications)
nmis-cli act=ack-event node=router1 \
  event="Interface Down" \
  element="GigabitEthernet0/0" \
  user="admin" \
  ack=false

# Acknowledge all events for node
nmis-cli act=ack-all-events node=router1 user="admin"

Event Logging

Event Log File

All events are logged to the event log file:
# From Events.pm:288-337
sub logEvent {
    my ( $self, %args ) = @_;
    
    my $node_name = $args{node_name};
    my $event     = $args{event};
    my $element   = $args{element};
    my $level     = $args{level};
    my $details   = $args{details};
    $details =~ s/,//g;  # Strip commas
    
    if ( !$node_name or !$event or !$level ) {
        return "cannot log event, required argument missing";
    }
    
    my $time = time();
    my $C    = NMISNG::Util::loadConfTable();
    
    sysopen( DATAFILE, "$C->{event_log}", O_WRONLY | O_APPEND | O_CREAT )
        or push( @problems, "Cannot open $C->{event_log}: $!" );
    flock( DATAFILE, LOCK_EX )
        or push( @problems, "Cannot lock $C->{event_log}: $!" );
    
    my $message = "$time,$node_name,$event,$level,$element,$details\n";
    print DATAFILE $message;
    close(DATAFILE);
}
Event Log Location:
  • Default: /usr/local/nmis9/logs/event.log
  • Format: CSV with timestamp, node, event, level, element, details

Viewing Event Logs

# Tail event log
tail -f /usr/local/nmis9/logs/event.log

# Search for specific events
grep "Node Down" /usr/local/nmis9/logs/event.log

# Events in last hour
awk -v since=$(date -d '1 hour ago' +%s) '$1 > since' \
  /usr/local/nmis9/logs/event.log

# Count events by type
cut -d, -f3 /usr/local/nmis9/logs/event.log | sort | uniq -c

Notifications

Notification Methods

NMIS supports multiple notification channels: Email Notifications:
  • HTML or plain text
  • Customizable templates
  • Multiple recipients per contact
  • Attachment support
Syslog:
  • RFC 3164/5424 format
  • Configurable facility and severity
  • Remote syslog servers
SNMP Traps:
  • SNMPv2c traps
  • Custom trap definitions
  • Includes event details as varbinds
Custom Scripts:
  • Execute external programs
  • Pass event data as arguments
  • Integrate with ticketing systems

Notification Configuration

Configure in Contacts.nmis:
{
  "admin": {
    "email": "[email protected]",
    "mobile": "+1-555-0100",
    "notify": {
      "node_down": true,
      "interface_down": true,
      "proactive": true
    },
    "escalate": {
      "0": true,
      "1": true
    }
  }
}

Event Filtering and Control

Events.nmis Configuration

Control which events are active, logged, and notified:
{
  "Node Down": {
    "Status": "true",
    "Log": "true",
    "Notify": "true"
  },
  "Interface Down": {
    "Status": "true",
    "Log": "true",
    "Notify": "true",
    "Filter": "ifAdminStatus eq 'up'"
  },
  "Proactive CPU": {
    "Status": "true",
    "Log": "true",
    "Notify": "true"
  }
}

Event Status Levels

Severity Mapping:
# From Event.pm:545-619
sub getLogLevel {
    my ( $self, %args ) = @_;
    my ( $S, $event, $level ) = @args{'sys', 'event', 'level'};
    
    my $role = $node->configuration->{roleType} || 'access';
    my $type = $node->configuration->{nodeType} || 'router';
    my $default_event_level = $self->nmisng->config->{default_event_level} // 'Major';
    
    # Get level from model
    if ( $mdl_level = $M->{event}{event}{lc $pol_event}{lc $role}{level} ) {
        $log    = $M->{event}{event}{lc $pol_event}{lc $role}{logging};
        $syslog = $M->{event}{event}{lc $pol_event}{lc $role}{syslog};
    }
    elsif ( $mdl_level = $M->{event}{event}{default}{lc $role}{level} ) {
        $log    = $M->{event}{event}{default}{lc $role}{logging};
        $syslog = $M->{event}{event}{default}{lc $role}{syslog};
    }
    else {
        $mdl_level = $default_event_level;
    }
    
    return ( $level, $log, $syslog );
}
Standard Levels:
  • Fatal: System failure, immediate action required
  • Critical: Critical condition, escalate immediately
  • Major: Major problem affecting service
  • Minor: Minor issue, may not need immediate action
  • Warning: Warning condition, informational
  • Normal: Normal operation, UP events

Stateless Events

Some events are stateless and reset automatically:
# From Event.pm:897-919
if ( $exists && $self->stateless ) {
    my $stateless_event_dampening = $self->nmisng->config->{stateless_event_dampening} || 900;
    
    # If stateless time exceeds dampening, reset escalation
    if ( time() > $self->startdate + $stateless_event_dampening ) {
        $self->active(1);
        $self->historic(0);
        $self->startdate(time);
        $self->escalate(-1);
        $self->ack(0);
    }
}
Stateless Event Types:
  • SNMP Traps
  • Custom Alerts
  • Some threshold events
Dampening Period:
  • Default: 900 seconds (15 minutes)
  • Configurable via stateless_event_dampening
  • Prevents notification flooding

Troubleshooting

Events Not Created

# Check event configuration
cat /usr/local/nmis9/conf/Events.nmis | grep "Node Down"

# Verify node is active
nmis-cli act=show node=router1 property=active

# Check polling logs
tail -f /usr/local/nmis9/logs/nmis.log | grep Event

# Force update to trigger event check
nmis-cli act=update node=router1

Notifications Not Sent

# Check contact configuration
cat /usr/local/nmis9/conf/Contacts.nmis

# Verify email settings
nmis-cli act=test-email [email protected]

# Check escalation log
tail -f /usr/local/nmis9/logs/escalate.log

# Test notification manually
nmis-cli act=test-notify contact=admin event="Test Event"

Event Not Clearing

# Check if condition actually cleared
nmis-cli act=update node=router1

# View event details
nmis-cli act=show-event node=router1 event="Node Down"

# Manually clear if needed
nmis-cli act=clear-event node=router1 event="Node Down"

Best Practices

  1. Configure Escalations: Set appropriate escalation levels and timing
  2. Use Acknowledgment: Acknowledge events to suppress notifications
  3. Review Event Logs: Regularly review event.log for patterns
  4. Tune Thresholds: Adjust to reduce false positives
  5. Document Events: Use details field for troubleshooting context
  6. Test Notifications: Regularly test notification delivery
  7. Monitor Event Count: Excessive events may indicate configuration issues

Next Steps

Performance Data

Configure thresholds based on collected performance metrics

Device Management

Manage nodes and configure event-related properties

Build docs developers (and LLMs) love