Nagios

Nagios Plugin for dmesg Monitoring

So far I found no easy solution to monitor for Linux kernel messages. So here is a simple Nagios plugin to scan dmesg output for interesting stuff:

#!/bin/bash

SEVERITIES="err,alert,emerg,crit"
WHITELIST="microcode: |\
Firmware Bug|\
i8042: No controller|\
Odd, counter constraints enabled but no core perfctrs detected|\
Failed to access perfctr msr|\
echo 0 > /proc/sys"

# Check for critical dmesg lines from this day
date=$(date "+%a %b %e")
output=$(dmesg -T -l "$SEVERITIES" | egrep -v "$WHITELIST" | grep "$date" | tail -5)

if [ "$output" == "" ]; then
	echo "All is fine."
	exit 0
fi

echo "$output" | xargs
exit 1

"Features" of the script above:

  • It gives you the 5 most recent messages from today
  • It allows to whitelist common but useless errors in $WHITELIST
  • It uses "dmesg" to work when you already have disk I/O errors and to be faster than syslog parsing

This script helped a lot to early on detect I/O errors, recoverable as well as corruptions. It even worked when entire root partition wasn't readable anymore, because then the Nagios check failed with "NRPE: unable to read output" which indicated that dmesg didn't work anymore. By always showing all errors from the entire day one cannot miss recovered errors that happened in non-office hours.

Another good thing about the check is detecting OOM kills or fast spawning of processes.

Simple Chef to Nagios Hostgroup Export

When you are automatizing with chef and use plain Nagios for monitoring you will find duplication quite some configuration. One large part is the hostgroup definitions which usually map many of the chef roles. So if the roles are defined in chef anyway they should be sync'ed to Nagios.

Using "knife" one can extract the roles of a node like this

knife node show -a roles $node | grep -v "^roles:"

Scripting The Role Dumping

Note though that knife only shows roles that were applied on the server already. But this shouldn't be a big problem for a synchronization solution. Next step is to create a usable hostgroup definition in Nagios. To avoid colliding with existing hostgroups let's prefix the generated hostgroup names with "chef-". The only challenge is the regrouping of the role lists given per node by chef into host name lists per role. In Bash 4 using an fancy hash this could be done like this:

declare -A roles

for node in $(knife node list); do
   for role in $(knife node show -a roles $i |grep -v "roles" ); do
      roles["$role"]=${roles["$role"]}"$i "
   done
done

Given this it is easy to dump Icinga hostgroup definitions. For example

for role in ${!roles[*]}; do
   echo "define hostgroup {
   hostgroup_name chef-$role
   members ${roles[$role]}
}
"
done

That makes ~15 lines of shell script and a cronjob entry to integrate Chef with Nagios. Of course you also need to ensure that each host name provided by chef has a Nagios host definition. If you know how it resolves you could just dump a host definition while looping over the host list. In any case there is no excuse not to export the chef config :-)

Easy Migrating

Migrating to such an export is easy by using the "chef-" namespace prefix for generated hostgroups. This allows you to smoothly migrate existing Nagions definitions at your own pace. Be sure to only reload Nagios and not restart via cron and to do it at reasonable time to avoid breaking things.

Syndicate content