Cheat Sheets

Recent Posts

Helm template check if key exists

Helm templates lists using range

Log detailed output in bats

ANSI color in Jenkins pipelines

Nagios Check Plugin for nofile Limit

  1. Check the global file descriptor limit
  2. Uses lsof to check all processes "nofile" hard limit
It has two simple parameters -w and -c to specify a percentage threshold. An example call:
./check_nofile_limit.sh -w 70 -c 85
could result in the following output indicating two problematic processes:
WARNING memcached (PID 2398) 75% of 1024 used CRITICAL apache (PID 2392) 94% of 4096 used
Here is the check script doing this:
#!/bin/bash


# MIT License
#
# Copyright (c) 2017  Lars Windolf 
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# Check "nofile" limit for all running processes using lsof

MIN_COUNT=0	# default "nofile" limit is usually 1024, so no checking for 
		# processes with much less open fds needed

WARN_THRESHOLD=80	# default warning:  80% of file limit used
CRITICAL_THRESHOLD=90	# default critical: 90% of file limit used

while getopts "hw:c:" option; do
	case $option in
		w) WARN_THRESHOLD=$OPTARG;;
		c) CRITICAL_THRESHOLD=$OPTARG;;	
		h) echo "Syntax: $0 [-w <warning percentage>] [-c <critical percentage>]"; exit 1;;
	esac
done

results=$(
# Check global limit
global_max=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 3)
global_cur=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 1)
ratio=$(( $global_cur * 100 / $global_max))

if [ $ratio -ge $CRITICAL_THRESHOLD ]; then
	echo "CRITICAL global file usage $ratio% of $global_max used"
elif [ $ratio -ge $WARN_THRESHOLD ]; then
	echo "WARNING global file usage $ratio% of $global_max used"
fi

# We use the following lsof options:
#
# -n 	to avoid resolving network names
# -b	to avoid kernel locks
# -w	to avoid warnings caused by -b
# +c15	to get somewhat longer process names
#
lsof -wbn +c15 2>/dev/null | awk '{print $1,$2}' | sort | uniq -c |\
while read count name pid remainder; do
	# Never check anything above a sane minimum
	if [ $count -gt $MIN_COUNT ]; then
		# Extract the hard limit from /proc
		limit=$(cat /proc/$pid/limits 2>/dev/null| grep 'open files' | awk '{print $5}')

		# Check if we got something, if not the process must have terminated
		if [ "$limit" != "" ]; then
			ratio=$(( $count * 100 / $limit ))
			if [ $ratio -ge $CRITICAL_THRESHOLD ]; then
				echo "CRITICAL $name (PID $pid) $ratio% of $limit used"
			elif [ $ratio -ge $WARN_THRESHOLD ]; then
				echo "WARNING $name (PID $pid) $ratio% of $limit used"
			fi
		fi
	fi
done
)

if echo $results | grep CRITICAL; then
	exit 2
fi
if echo $results | grep WARNING; then
	exit 1
fi

echo "All processes are fine."
Use the script with caution! At the moment it has no protection against a hanging lsof. So the script might mess up your system if it hangs for some reason. If you have ideas how to improve it please share them in the comments!

Prometheus and M3DB in Docker in 5min

Preconditions

Quick Setup

Download stuff
docker pull quay.io/m3/m3dbnode:latest
wget https://github.com/prometheus/prometheus/releases/download/v2.6.0/prometheus-2.6.0.linux-amd64.tar.gz
tar zxvf prometheus-2.6.0.linux-amd64.tar.gz
Start M3DB
mkdir m3db_data
docker run -p 7201:7201 -p 9003:9003 --name m3db -v $(pwd)/m3db_data:/var/lib/m3db quay.io/m3/m3dbnode:latest
Configure and Start Prometheus
cd prometheus-2.6.0

# Add m3coordinator as read/write backend
cat <<EOT
remote_read:
  - url: "http://localhost:7201/api/v1/prom/remote/read"
    # To test reading even when local Prometheus has the data
    read_recent: true
remote_write:
  - url: "http://localhost:7201/api/v1/prom/remote/write"
EOT >>prometheus.yml

./prometheus --config.file="prometheus.yml

Test

Verify M3DB is running by accessing http://localhost:7201/api/v1/openapi. Check that Prometheus is running and returning its self-monitoring metrics http://localhost:9090/graph?g0.range_input=1d&g0.expr=go_memstats_alloc_bytes&g0.tab=0

curl and HTTP 1.1 keepalive test traffic

 curl -w "$(date +%FT%T)    dns %{time_namelookup}    connect %{time_connect}   firstbyte %{time_starttransfer}   total %{time_total}   HTTP %{http_code}\n" -o /dev/null -s "https://example.com"
which executed in a loop will give you a nice request trace like this:
2018-10-30T23:08:53    dns 0,012667    connect 0,112453   firstbyte 0,440164   total 0,440420   HTTP 200
2018-10-30T23:08:54    dns 0,060853    connect 0,161769   firstbyte 0,506141   total 0,506381   HTTP 200
2018-10-30T23:08:56    dns 0,028415    connect 0,128208   firstbyte 0,463084   total 0,463375   HTTP 200
2018-10-30T23:08:57    dns 0,012420    connect 0,113948   firstbyte 0,460305   total 0,460630   HTTP 200
2018-10-30T23:08:59    dns 0,028618    connect 0,128600   firstbyte 0,465260   total 0,465624   HTTP 200
[...]
Now the columns help you identifying problem classes with 'dns' being obvious, 'connect' (meaning time to connect) helping you identifying OS or network issues, while 'firstbyte' giving a hint on app server responsiveness and the difference of 'firstbyte' and 'total' usually indicates actual application response time. But what about HTTP/1.1 and persistent connections. So your test client has to open 1 connection and send subsequent requests on this connection? Even this is possible using the -K switch of curl which allows you to pass a file of URLs to fetch. Together with --keepalive curl will execute all URLs on the same server connection. Here is an example:
curl -sw "$(date +%FT%T)    dns %{time_namelookup}    connect %{time_connect}   firstbyte %{time_starttransfer}   total %{time_total}   HTTP %{http_code}\n" --keepalive -K <(printf 'url="https://example.com/"\n%.0s' {1..10000}) 2>/dev/null | grep dns
The curl command looks similar to before expect that you now do not need a loop, as we use three pieces of bash magic to create the loop. First we pass a sequence of numbers {1..10000} to printf, which is the number of requests we want to perform, but we choose a 'useless' pattern in printf "%.0s" so the number is not printed and the string stays static (just the URL we want). This way we get the URL printed 10000 times as input for curl. Using the bash construct "<()" we create an ad-hoc file handle which we pass to -K. Note how output redirection (-o) doesn't work when fetching multiple URLs with -K as curl want a separat output handle per URL which we cannot provide. We workaround this by filtering for our intended -w output with "|grep dns". Running this get us
2018-10-30T23:28:19    dns 0,012510    connect 0,114990   firstbyte 0,465874   total 0,466031   HTTP 200
2018-10-30T23:28:19    dns 0,000087    connect 0,000093   firstbyte 0,195553   total 0,195697   HTTP 200
2018-10-30T23:28:19    dns 0,000065    connect 0,000069   firstbyte 0,104634   total 0,104775   HTTP 200
2018-10-30T23:28:19    dns 0,000101    connect 0,000105   firstbyte 0,103206   total 0,103354   HTTP 200
2018-10-30T23:28:19    dns 0,000068    connect 0,000073   firstbyte 0,103591   total 0,104184   HTTP 200
[...]
Note how only the first 'dns' and 'connect' time was in millisecond range, while all following 'dns' and 'connect' values are mere µseconds indicating a reused connection and no new DNS query. Hope this helps! How do you guys do ad-hoc debugging of problems with persistent connection? Especially finding sporadic errors. Would be great if you could leave a comment with suggestions!

Filter AWS EC2 JSON with jq

aws ec2 describe-instances --filters "Name=instance-type,Values=m1.small,m1.medium" "Name=availability-zone,Values=us-west-2c"
you might need to do dozens of queries to find different infos. And in the end you still have to do the JSON parsing on all of the results, despite just wanting results like some IP or some tags or instance states... So why not issue
aws ec2 describe-instances >output.json
and use the mad jq syntax. Remember jq? The awesome command line tool that forgot all about XPath or jquery like DOM lookup syntax, that at least some people find intuitive and can use, and invented an even more sick filter language and has a manpage from which simply no one can extract any results from? Well it is still a useful CLI tool available in most Linux distros. So why not use it? For example: extract all external EC2 names for a given tag $mytag from our cached JSON:
cat output.json | jq -r ".Reservations[].Instances[] | select(.Tags | length > 0) | select( .Tags[].Value == \""$mytag"\") | {PublicDnsName} | flatten | .[0]"
If you analyze the query you might notice Easy right? No JSON parsing at all! If you dare to script using AWS CLI and jq check out the and the online test tool jqplay.

Network Split Test Scripts

Script Usage

./network_split.sh <filter1> <filter2> <hosts>
./network_join.sh <filter1> <filter2> <hosts>
The script expects SSH equivalency and sudo on the target hosts. The filters are grep patterns.

network_split.sh

#!/bin/bash

group1_filter=$1; shift
group2_filter=$1; shift
hosts=$*

hosts1=$(echo $hosts | xargs -n1 | grep "$group1_filter")
hosts2=$(echo $hosts | xargs -n1 | grep "$group2_filter")

if [ "$hosts1" == "" -o "$hosts2" == "" ]; then
	echo "ERROR: Syntax: $0   "
	exit 1
fi

for h in $hosts1; do
	echo "backlisting other zone on $h"
	for i in $hosts2; do
		ssh $h sudo route add $i gw 127.0.0.1 lo
	done
done
for h in $hosts2; do
	echo "Backlisting other zone on $h"
	for i in $hosts1; do
		ssh $h sudo route add $i gw 127.0.0.1 lo
	done
done

network_join.sh

#!/bin/bash

group1_filter=$1; shift
group2_filter=$1; shift
hosts=$*

hosts1=$(echo $hosts | xargs -n1 | grep "$group1_filter")
hosts2=$(echo $hosts | xargs -n1 | grep "$group2_filter")

if [ "$hosts1" == "" -o "$hosts2" == "" ]; then
	echo "ERROR: Syntax: $0   "
	exit 1
fi

for h in $hosts1; do
	echo "De-blacklisting other zone on $h"
	for i in $hosts2; do
		ssh $h sudo route del $i gw 127.0.0.1 lo
	done
done
for h in $hosts2; do
	echo "De-blacklisting other zone on $h"
	for i in $hosts1; do
		ssh $h sudo route del $i gw 127.0.0.1 lo
	done
done

Sequence definitions with kwalify

Defining Arbitrary Scalar Sequences

So how to define a list in kwalify? The user guide gives this example:
---
list:
  type: seq
  sequence:
     - type: str
This gives us a list of strings. But many lists also contain numbers and some contain structured data. For my use case I want to exclude structured date AND allow numbers. So "type: any" cannot be used. Also "type: any" would'nt work because it would require defining the mapping for any, which in a validation use case where we just want to ensure the list as a type, we cannot know. The great thing is there is a type "text" which you can use to allow a list of strings or number or both like this:
---
list:
  type: seq
  sequence:
     - type: text

Building a key name + type validation schema

As already mentioned the need for this is to have a whitelisting schema with simple type validation. Below you see an example for such a schema:
---
type: map
mapping:
  "default_definition": &allow_hash
     type: map
     mapping:
       =:
         type: any

  "default_list_definition": &allow_list
     type: seq
     sequence:
       # Type text means string or number
       - type: text

  "key1": *allow_hash
  "key2": *allow_list
  "key3":
     type: str

  =:
    type: number
    range: { max: 29384855, min: 29384855 }
At the top there are two dummy keys "default_definition" and "default_list_definition" which we use to define two YAML references "allow_hash" and "allow_list" for generic hashes and scalar only lists. In the middle of the schema you see three keys which are whitelisted and using the references are typed as hash/list and also as a string. Finally for this to be a whitelist we need to refuse all other keys. Note that '=' as a key name stands for a default definition. Now we want to say: default is "not allowed". Sadly kwalify has no mechanism for this that allows expressing something like
---
  =:
    type: invalid
Therefore we resort to an absurd type definition (that we hopefully never use) for example a number that has to be exactly 29384855. All other keys not listed in the whitelist above, hopefully will fail to be this number an cause kwalify to throw an error. This is how the kwalify YAML whitelist works.

PyPI does brownouts for legacy TLS

Puppet Agent Settings Issue

What first confused me was the "splay" was not on per-default. Of course when using the open source version it makes sense to have it off. Having it on per-default sounds more like an enterprise feature :-) No matter the default after deploying an agent config with settings like this
[agent]
runInterval = 3600
splay = true
splayLimit = 3600
... nothing happened. Runs were still not randomized. Checking the active configuration with
# puppet config print | grep splay
splay=false
splayLimit=1800
turned out that my config settings were not working at all. What was utterly confusing is that even the runInterval was reported as 1800 (which is the default value). But while the splay just did not work the effective runInterval was 3600! After hours of debugging it, I happened to read the puppet documentation section that covers the config sections like [agent] and [main]. It says that [main] configures global settings and other sections can override the settings in [main], which makes sense. But it just doesn't work this way. In the end the solution was using [main] as config section instead of [agent]:
[main]
runInterval=3600
splay=true
splayLimit=3600
and with this config "puppet config print" finally reported the settings as effective and the runtime behaviour had the expected randomization. Maybe I misread something somewhere, but this is really hard to debug. And INI file are not really helpful in Unix. Overriding works better default files and with drop dirs.

Python re.sub Examples

Syntax

import re

result = re.sub(pattern, repl, string, count=0, flags=0);

Simple Examples

num = re.sub('abc',  '',    input)           # Delete pattern abc
num = re.sub('abc',  'def', input)           # Replace pattern abc -> def
num = re.sub(r'\s+', ' ',   input)           # Eliminate duplicate whitespaces
num = re.sub('abc(def)ghi', r'\1', input)    # Replace a string with a part of itself
Note: Take care to always prefix patterns containing \ escapes with raw strings (by adding an r in front of the string). Otherwise the \ is used as an escape sequence and the regex won't work.

Advance Usage

Replacement Function

Instead of a replacement string you can provide a function performing dynamic replacements based on the match string like this:
def my_replace(m):
    if :
       return <replacement variant 1>
    return <replacement variant 2>

result = re.sub(r"\w+", my_replace, input)

Count Replacements

When you want to know how many replacements did happen use re.subn() instead:
result = re.sub(pattern, replacement, input)
print ('Result: ', result[0])
print ('Replacements: ', result[1])

See also:

Helm Error: cannot connect to Tiller

$ helm status
Error: could not find tiller
It took me some minutes to find the root cause. First thing I thought was, that the tiller installation was gone/broken, which turned out to be fine. The root cause was that the helm client didn't select the correct namespace and probably stayed in the current namespace (where tiller isn't located). This is due to the use of an environment variable $TILLER_NAMESPACE (as suggested in the setup docs) which I forgot to persist in my shell. So running
$ TILLER_NAMESPACE=tiller helm status
solved the issue.

Using Linux keyring secrets from your scripts

libsecret

Since 2008 the Secret Service API is standardized via freedesktop.org and is implemented by GnomeKeyring and ksecretservice. Effectivly there is standard interface to access secrets on Linux desktops. Sadly the CLI tools are rarely installed by default so you have to add them manually. On Debian
apt install libsecret-tools

Using secret-tool

There are two important modes:

Fetching passwords

The "lookup" command prints the password to STDOUT
/usr/bin/secret-tool lookup <key> <name>

Storing passwords

Note that with "store" you do not pass the password, as a dialog is raised to add it.
/usr/bin/secret-tool store <key> <name>

Scripting with secret-tool

Here is a simple example Bash script to automatically ask, store and use a secret:
#!/bin/bash


ST=/usr/bin/secret-tool
LOGIN="my-login"		# Unique id for your login
LABEL="My special login"	# Human readable label

get_password() {
   $ST lookup "$LOGIN" "$USER"
}

password=$( get_password )
if [ "$password" = "" ]; then
    $ST store --label "$LABEL" "$LOGIN" "$USER"
    password=$( get_password )
fi

if [ "$password" = "" ]; then
    echo "ERROR: Failed to fetch password!"
else
    echo "Credentials: user=$USER password=$password"
fi

Note that the secret will appear in the "Login" keyring. On GNOME you can check the secret with "seahorse".

How to install Helm on Openshift

What is Helm?

Before going into details: helm is a self-proclaimed "Kubernetes Package Manager". While this is not entirly false in my opinion it is three things When looking closer it does more of the stuff that automation tools like Puppet, Chef and Ansible do.

Current Installation Issues

Since kubernetes v1.6.1, which introduced RBAC (role based access control) it became harder to properly install helm. Actually the simple installation as suggested on the homepage
# Download and...
helm init
seems to work, but as soon as you run commands like
helm list
you get permission errors. This of course being caused by the tighter access control now being in place. Sadly even now being at kubernetes 1.8 helm still wasn't updated to take care of the proper permissions.

Openshift to the rescue...

As Redhat somewhat pioneered RBAC in Openshift with their namespace based "projects" concept they are also the ones with a good solution for the helm RBAC troubles.

Setting up Helm on Openshift

Client installation (helm)

curl -s https://storage.googleapis.com/kubernetes-helm/helm-v2.6.1-linux-amd64.tar.gz | tar xz
sudo mv linux-amd64/helm /usr/local/bin
sudo chmod a+x /usr/local/bin/helm

helm init --client-only

Server installation (tiller)

With helm being the client only, Helm needs an agent named "tiller" on the kubernetes cluster. Therefore we create a project (namespace) for this agent an install it with "oc create"
export TILLER_NAMESPACE=tiller
oc new-project tiller
oc project tiller
oc process -f https://github.com/openshift/origin/raw/master/examples/helm/tiller-template.yaml -p TILLER_NAMESPACE="${TILLER_NAMESPACE}" | oc create -f -
oc rollout status deployment tiller

Preparing your projects (namespaces)

Finally you have to give tiller access to each of the namespaces you want someone to manage using helm:
export TILLER_NAMESPACE=tiller
oc project 
oc policy add-role-to-user edit "system:serviceaccount:${TILLER_NAMESPACE}:tiller"
After you did this you can deploy your first service, e.g.
helm install stable/redis --namespace 

See also

Apply-changes-to-limits.conf-immediately

Instant Applying Limits to Running Processes

Actually you might want to apply the changes directly to a running process additionally to changing /etc/security/limits.conf. In recent edge Linux distributions (e.g. Debian Jessie) there is a tool "prlimit" to get/set limits. Usage for changing limits for a PID is

prlimit --pid <pid> --<limit>=<soft>:<hard>
for example
prlimit --pid 12345 --nofile=1024:2048
If you are unlucky and do not have prlimit yet check out "man 2 prlimit" for instructions on how to compile your own version because despite missing user tool the prlimit() system call is in the kernel for quite a while (since 2.6.36).

Alternative #1: Re-Login with "sudo -i"

If you do not have prlimit yet and want a changed limit configuration to become visible you might want to try "sudo -i". The reason: you need to re-login as limits from /etc/security/* are only applied on login! But wait: what about users without login? In such a case you login as root (which might not share their limits) and sudo into the user: so no real login as the user. In this case you must ensure to use the "-i" option of sudo:
sudo -i -u <user>
to simulate an initial login with sudo. This will apply the new limits.

Alternative #2: Make it work for sudo without "-i"

Wether you need "-i" depends on the PAM configuration of your Linux distribution. If you need it then PAM probably loads "pam_limit.so" only in /etc/pam.d/login which means at login time but no on sudo. This was introduced in Ubuntu Precise for example. By adding this line

session    required   pam_limits.so
in /etc/pam.d/sudo limits will also be applied when running sudo without "-i". Still using "-i" might be easier.

Finally: Always Check Effective Limits

The best way is to change the limits and check them by running
prlimit               # for current shell
prlimit --pid <pid>   # for a running process
because it shows both soft and hard limits together. Alternatively call
ulimit -a                # for current shell
cat /proc/<pid>/limits   # for a running process
with the affected user.

Nagios Check

You might also want to have a look at the nofile limit Nagios check.