:::: MENU ::::
We are using Nagios XI to monitor our IT infrastructure.  We are monitoring over 2000 hosts and 11000 services on a heterogeneous environment.  It was recently noticed that on a few hosts the performance graphs were displayed but had no data, they were blank.  This problem was limited to some graphs for disk monitoring.

We are using two different command definition, service template and service configuration to monitor the status of disks on the Unix and Windows hosts using SNMP.  The Disk monitor graphs were displayed properly on majority of the hosts.  Now, the first problem was to find all the hosts on which they were blank.

A closer look at the XML files of "Disk_Monitor" service for which the graphs were blank displayed two suspecting lines:

<RC>256</RC>
<TXT>update failed</TXT>

This was a good hint.  A quick BASH one liner displayed all the offensive hosts with blank disk monitor graphs:

$ cd /usr/local/nagios/share/perfdata
$ grep -i --color 'update failed' $(find . -type f -name "Disk_Monitor.xml" )

Nagios XI stores the performance data in /usr/local/nagios/share/perfdata directory.  There is a RRD and an associated XML file for each host and service being monitored.

The next step was to remove all the offensive files (.rrd and .xml) and start afresh which resulted in another quick BASH one liner:

$ for i in $(grep -il --color 'update failed' $(find . -type f -name "Disk_Monitor.xml" ) | sort) ; do rm -v $i ; rm -v $(echo $i | sed -e 's/xml/rrd/'); done

Unfortunately, all the old data was gone on problematic hosts as there seems to be no easy way to fix or recover them.  The good thing is that the new graphs are now being displayed.


Categories: , ,

0 Comments:

Post a Comment