Monitoring

Nagios Plugin for dmesg Monitoring

So far I found no easy solution to monitor for Linux kernel messages. So here is a simple Nagios plugin to scan dmesg output for interesting stuff:

#!/bin/bash

SEVERITIES="err,alert,emerg,crit"
WHITELIST="microcode: |\
Firmware Bug|\
i8042: No controller|\
Odd, counter constraints enabled but no core perfctrs detected|\
Failed to access perfctr msr|\
echo 0 > /proc/sys"

# Check for critical dmesg lines from this day
date=$(date "+%a %b %e")
output=$(dmesg -T -l "$SEVERITIES" | egrep -v "$WHITELIST" | grep "$date" | tail -5)

if [ "$output" == "" ]; then
	echo "All is fine."
	exit 0
fi

echo "$output" | xargs
exit 1

"Features" of the script above:

  • It gives you the 5 most recent messages from today
  • It allows to whitelist common but useless errors in $WHITELIST
  • It uses "dmesg" to work when you already have disk I/O errors and to be faster than syslog parsing

This script helped a lot to early on detect I/O errors, recoverable as well as corruptions. It even worked when entire root partition wasn't readable anymore, because then the Nagios check failed with "NRPE: unable to read output" which indicated that dmesg didn't work anymore. By always showing all errors from the entire day one cannot miss recovered errors that happened in non-office hours.

Another good thing about the check is detecting OOM kills or fast spawning of processes.

Simple Chef to Nagios Hostgroup Export

When you are automatizing with chef and use plain Nagios for monitoring you will find duplication quite some configuration. One large part is the hostgroup definitions which usually map many of the chef roles. So if the roles are defined in chef anyway they should be sync'ed to Nagios.

Using "knife" one can extract the roles of a node like this

knife node show -a roles $node | grep -v "^roles:"

Scripting The Role Dumping

Note though that knife only shows roles that were applied on the server already. But this shouldn't be a big problem for a synchronization solution. Next step is to create a usable hostgroup definition in Nagios. To avoid colliding with existing hostgroups let's prefix the generated hostgroup names with "chef-". The only challenge is the regrouping of the role lists given per node by chef into host name lists per role. In Bash 4 using an fancy hash this could be done like this:

declare -A roles

for node in $(knife node list); do
   for role in $(knife node show -a roles $i |grep -v "roles" ); do
      roles["$role"]=${roles["$role"]}"$i "
   done
done

Given this it is easy to dump Icinga hostgroup definitions. For example

for role in ${!roles[*]}; do
   echo "define hostgroup {
   hostgroup_name chef-$role
   members ${roles[$role]}
}
"
done

That makes ~15 lines of shell script and a cronjob entry to integrate Chef with Nagios. Of course you also need to ensure that each host name provided by chef has a Nagios host definition. If you know how it resolves you could just dump a host definition while looping over the host list. In any case there is no excuse not to export the chef config :-)

Easy Migrating

Migrating to such an export is easy by using the "chef-" namespace prefix for generated hostgroups. This allows you to smoothly migrate existing Nagions definitions at your own pace. Be sure to only reload Nagios and not restart via cron and to do it at reasonable time to avoid breaking things.

HowTo: Munin and rrdcached on Ubuntu 12.04

Let's expect you already have Munin installed and working and you want to reduce disk I/O and improve responsiveness by adding rrdcached... Here are the complete steps to integrate rrdcached:

Basic Installation

First install the stock package

apt-get install rrdcached

and integrate it with Munin:

  1. Enable the rrdcached socket line in /etc/munin/munin.conf
  2. Disable munin-html and munin-graph calls in /usr/bin/munin-cron
  3. Create /usr/bin/munin-graph with
    #!/bin/bash
    
    nice /usr/share/munin/munin-html $@ || exit 1
    
    nice /usr/share/munin/munin-graph --cron $@ || exit 1 

    and make it executable

  4. Add a cron job (e.g. to /etc/cron.d/munin) to start munin-graph:
    10 * * * *      munin if [ -x /usr/bin/munin-graph ]; then /usr/bin/munin-graph; fi

The Critical Stuff

To get Munin to use rrdcached on Ubuntu 12.04 ensure to follow these vital steps:

  1. Add "-s <webserver group>" to $OPT in /etc/init.d/rrdcached (in front of the first -l switch)
  2. Change "-b /var/lib/rrdcached/db/" to "-b /var/lib/munin" (or wherever you keep your RRDs)


So a patched default Debian/Ubuntu with Apache /etc/init.d/rrdcached would have

OPTS="-s www-data -l unix:/var/run/rrdcached.sock"
OPTS="$OPTS -j /var/lib/rrdcached/journal/ -F"
OPTS="$OPTS -b /var/lib/munin/ -B"

If you do not set the socket user with "-s" you will see "Permission denied" in /var/log/munin/munin-cgi-graph.log

[RRD ERROR] Unable to graph /var/lib/munin/
cgi-tmp/munin-cgi-graph/[...].png : Unable to connect to rrdcached: 
Permission denied

If you do not change the rrdcached working directory you will see "rrdc_flush" errors in your /var/log/munin/munin-cgi-graph.log

[RRD ERROR] Unable to graph /var/lib/munin/
cgi-tmp/munin-cgi-graph/[...].png : 
rrdc_flush (/var/lib/munin/[...].rrd) failed with status -1.

Some details on this can be found in the Munin wiki.

How to decrease the Munin log level

Munin tends to spam the log files it writes (munin-update.log, munin-graph.log...) with many lines at INFO log level. It also doesn't respect syslog log levels as it uses Log4perl.

In difference to the munin-node configuration (munin-node.conf) there is no "log_level" setting in the munin.conf at least in the versions I'm using right now.

So let's fix the code. In /usr/share/munin/munin-<update|graph> find the following lines:

logger_open($config->{'logdir'});
logger_debug() if $config->{debug} or defined($ENV{CGI_DEBUG});

and change it to

logger_open($config->{'logdir'});
logger_debug() if $config->{debug} or defined($ENV{CGI_DEBUG});
logger_level("warn"); 

As parameter to logger_level() you can provide "debug", "info", "warn", "error" or "fatal" (see manpage "Munin::Master::Logger".

And finally: silent logs!

Memcache Monitoring GUIs

When using memcached or memcachedb everything is fine as long as it is running. But from an operating perspective memcached is a black box. There is no real logging you can only use the -v/-vv/-vvv switches when not running in daemon mode to see what your instance does. And it becomes even more complex if you run multiple or distributed memcache instances available on different hosts and ports.

So the question is: How to monitor your distributed memcache setup?

There are not many tools out there, but some useful are. We'll have a look at the following tools. Note that some can monitor multiple memcached instances, while others can only monitor a single instance at a time.

Name Multi-Instances Complexity/Features
telnet no Simple CLI via telnet
memcached-top no CLI
stats-proxy yes Simple Web GUI
memcache.php yes Simple Web GUI
PhpMemcacheAdmin yes Complex Web GUI
Memcache Manager yes Complex Web GUI

1. Use telnet!

memcached already brings it own statistics which are accessible via telnet for manual interaction and are the basis for all other monitoring tools. Read more about using the telnet interface.

2. Live Console Monitoring with memcached-top

You can use memcache-top for live-monitoring a single memcached instance. It will give you the I/O throughput, the number of evictions, the current hit ratio and if run with "--commands" it will also provide the number of GET/SET operations per interval.

memcache-top v0.6       (default port: 11211, color: on, refresh: 3 seconds)

INSTANCE                USAGE   HIT %   CONN    TIME    EVICT/s GETS/s  SETS/s  READ/s  WRITE/s 
10.50.11.5:11211        88.9%   69.7%   1661    0.9ms   0.3     47      9       13.9K   9.8K    
10.50.11.5:11212        88.8%   69.9%   2121    0.7ms   1.3     168     10      17.6K   68.9K   
10.50.11.5:11213        88.9%   69.4%   1527    0.7ms   1.7     48      16      14.4K   13.6K   
[...]

AVERAGE:                84.7%   72.9%   1704    1.0ms   1.3     69      11      13.5K   30.3K   

TOTAL:          19.9GB/ 23.4GB          20.0K   11.7ms  15.3    826     132     162.6K  363.6K  
(ctrl-c to quit.)

(Example output)

3. Live browser monitoring with statsproxy

Using the statsproxy tool you get a browser-based statistics tool for multiple memcached instances. The basic idea of statsproxy is to provide the unmodified memcached statistics via HTTP. It also provide a synthetic health check for service monitoring tools like Nagios.

To compile statsproxy on Debian:

# Ensure you have bison
sudo apt-get install bison

# Download tarball
tar zxvf statsproxy-1.0.tgz
cd statsproxy-1.0
make

Now you can run the "statsproxy" binary, but it will inform you that it needs a configuration file. I suggest to redirect the output to a new file e.g. "statsproxy.conf" and remove the information text on top and bottom and then to modify the configuration section as needed.

./statsproxy > statsproxy.conf 2>&1

Ensure to add as many "proxy-mapping" sections as you have memcached instances. In each "proxy-mapping" section ensure that "backend" points to your memcached instance and "frontend" to a port on your webserver where you want to access the information for this backend.

Once finished run:

./statsproxy -F statsproxy.conf

Below you find a screenshot of what stats-proxy looks like:

4. Live browser monitoring with memcache.php

Using this PHP script you can quickly add memcached statistics to a webserver of your choice. Most useful is the global memory usage graph which helps to identify problematic instances in a distributed environment.

Here is how it should look (screenshot from the project homepage):

When using this script ensure access is protected and not to trigger the "flush_all" menu option by default. Also on large memcached instances refrain from dumping the keys as it might cause some load on your server.

Syndicate content