Nagios Check Plugin for nofile Limit

Following the recent post on how to investigate limit related issues which gave instructions what to check if you suspect a system limit to be hit I want to share this Nagios check to cover the open file descriptor limit. Note that existing Nagios plugins like this only check the global limit, only check one application or do not output all problems. So here is my solution which does:

  1. Check the global file descriptor limit
  2. Uses lsof to check all processes "nofile" hard limit
It has two simple parameters -w and -c to specify a percentage threshold. An example call:
./check_nofile_limit.sh -w 70 -c 85
could result in the following output indicating two problematic processes:
WARNING memcached (PID 2398) 75% of 1024 used CRITICAL apache (PID 2392) 94% of 4096 used
Here is the check script doing this:
#!/bin/bash


# Check "nofile" limit for all running processes using lsof

MIN_COUNT=0 # default "nofile" limit is usually 1024, so no checking for # processes with much less open fds needed

WARN_THRESHOLD=80 # default warning: 80% of file limit used CRITICAL_THRESHOLD=90 # default critical: 90% of file limit used

while getopts "hw:c:" option; do case $option in w) WARN_THRESHOLD=$OPTARG;; c) CRITICAL_THRESHOLD=$OPTARG;; h) echo "Syntax: $0 [-w <warning percentage>] [-c <critical percentage>]"; exit 1;; esac done

results=$( # Check global limit global_max=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 3) global_cur=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 1) ratio=$(( $global_cur * 100 / $global_max))

if [ $ratio -ge $CRITICAL_THRESHOLD ]; then echo "CRITICAL global file usage $ratio% of $global_max used" elif [ $ratio -ge $WARN_THRESHOLD ]; then echo "WARNING global file usage $ratio% of $global_max used" fi

# We use the following lsof options: # # -n to avoid resolving network names # -b to avoid kernel locks # -w to avoid warnings caused by -b # +c15 to get somewhat longer process names # lsof -wbn +c15 2>/dev/null | awk '{print $1,$2}' | sort | uniq -c |\ while read count name pid remainder; do # Never check anything above a sane minimum if [ $count -gt $MIN_COUNT ]; then # Extract the hard limit from /proc limit=$(cat /proc/$pid/limits 2>/dev/null| grep 'open files' | awk '{print $5}')

# Check if we got something, if not the process must have terminated if [ "$limit" != "" ]; then ratio=$(( $count * 100 / $limit )) if [ $ratio -ge $CRITICAL_THRESHOLD ]; then echo "CRITICAL $name (PID $pid) $ratio% of $limit used" elif [ $ratio -ge $WARN_THRESHOLD ]; then echo "WARNING $name (PID $pid) $ratio% of $limit used" fi fi fi done )

if echo $results | grep CRITICAL; then exit 2 fi if echo $results | grep WARNING; then exit 1 fi

echo "All processes are fine."
Use the script with caution! At the moment it has no protection against a hanging lsof. So the script might mess up your system if it hangs for some reason. If you have ideas how to improve it please share them in the comments!