Cheat Sheets

Recent Posts

Prometheus and M3DB in Docker in 5min


Quick Setup

Download stuff
docker pull
tar zxvf prometheus-2.6.0.linux-amd64.tar.gz
Start M3DB
mkdir m3db_data
docker run -p 7201:7201 -p 9003:9003 --name m3db -v $(pwd)/m3db_data:/var/lib/m3db
Configure and Start Prometheus
cd prometheus-2.6.0

# Add m3coordinator as read/write backend cat <<EOT remote_read: - url: "http://localhost:7201/api/v1/prom/remote/read" # To test reading even when local Prometheus has the data read_recent: true remote_write: - url: "http://localhost:7201/api/v1/prom/remote/write" EOT >>prometheus.yml

./prometheus --config.file="prometheus.yml


Verify M3DB is running by accessing http://localhost:7201/api/v1/openapi.

Check that Prometheus is running and returning its self-monitoring metrics http://localhost:9090/graph?g0.range_input=1d&g0.expr=go_memstats_alloc_bytes&

curl and HTTP 1.1 keepalive test traffic

Curl is really helpful for debugging when after recognizing a problem you need to decide wether it is a network issue, a DNS problem, an app server problem or an application performance problem.

To try to isolate the problem type you can run curl like this

 curl -w "$(date +%FT%T)    dns %{time_namelookup}    connect %{time_connect}   firstbyte %{time_starttransfer}   total %{time_total}   HTTP %{http_code}\n" -o /dev/null -s ""
which executed in a loop will give you a nice request trace like this:
2018-10-30T23:08:53    dns 0,012667    connect 0,112453   firstbyte 0,440164   total 0,440420   HTTP 200
2018-10-30T23:08:54    dns 0,060853    connect 0,161769   firstbyte 0,506141   total 0,506381   HTTP 200
2018-10-30T23:08:56    dns 0,028415    connect 0,128208   firstbyte 0,463084   total 0,463375   HTTP 200
2018-10-30T23:08:57    dns 0,012420    connect 0,113948   firstbyte 0,460305   total 0,460630   HTTP 200
2018-10-30T23:08:59    dns 0,028618    connect 0,128600   firstbyte 0,465260   total 0,465624   HTTP 200
Now the columns help you identifying problem classes with 'dns' being obvious, 'connect' (meaning time to connect) helping you identifying OS or network issues, while 'firstbyte' giving a hint on app server responsiveness and the difference of 'firstbyte' and 'total' usually indicates actual application response time.

But what about HTTP/1.1 and persistent connections. So your test client has to open 1 connection and send subsequent requests on this connection?

Even this is possible using the -K switch of curl which allows you to pass a file of URLs to fetch. Together with --keepalive curl will execute all URLs on the same server connection. Here is an example:
curl -sw "$(date +%FT%T)    dns %{time_namelookup}    connect %{time_connect}   firstbyte %{time_starttransfer}   total %{time_total}   HTTP %{http_code}\n" --keepalive -K <(printf 'url=""\n%.0s' {1..10000}) 2>/dev/null | grep dns
The curl command looks similar to before expect that you now do not need a loop, as we use three pieces of bash magic to create the loop. First we pass a sequence of numbers {1..10000} to printf, which is the number of requests we want to perform, but we choose a 'useless' pattern in printf "%.0s" so the number is not printed and the string stays static (just the URL we want). This way we get the URL printed 10000 times as input for curl. Using the bash construct "<()" we create an ad-hoc file handle which we pass to -K. Note how output redirection (-o) doesn't work when fetching multiple URLs with -K as curl want a separat output handle per URL which we cannot provide. We workaround this by filtering for our intended -w output with "|grep dns".

Running this get us
2018-10-30T23:28:19    dns 0,012510    connect 0,114990   firstbyte 0,465874   total 0,466031   HTTP 200
2018-10-30T23:28:19    dns 0,000087    connect 0,000093   firstbyte 0,195553   total 0,195697   HTTP 200
2018-10-30T23:28:19    dns 0,000065    connect 0,000069   firstbyte 0,104634   total 0,104775   HTTP 200
2018-10-30T23:28:19    dns 0,000101    connect 0,000105   firstbyte 0,103206   total 0,103354   HTTP 200
2018-10-30T23:28:19    dns 0,000068    connect 0,000073   firstbyte 0,103591   total 0,104184   HTTP 200
Note how only the first 'dns' and 'connect' time was in millisecond range, while all following 'dns' and 'connect' values are mere ┬Áseconds indicating a reused connection and no new DNS query.

Hope this helps! How do you guys do ad-hoc debugging of problems with persistent connection? Especially finding sporadic errors. Would be great if you could leave a comment with suggestions!

Filter AWS EC2 JSON with jq

The AWS CLI is fine, but dumping stuff becomes a pain the more stuff is in your account and when you want to extract multiple things depending on each other.

While you can run filter queries like

aws ec2 describe-instances --filters "Name=instance-type,Values=m1.small,m1.medium" "Name=availability-zone,Values=us-west-2c"
you might need to do dozens of queries to find different infos. And in the end you still have to do the JSON parsing on all of the results, despite just wanting results like some IP or some tags or instance states...

So why not issue

aws ec2 describe-instances >output.json
and use the mad jq syntax. Remember jq? The awesome command line tool that forgot all about XPath or jquery like DOM lookup syntax, that at least some people find intuitive and can use, and invented an even more sick filter language and has a manpage from which simply no one can extract any results from? Well it is still a useful CLI tool available in most Linux distros. So why not use it?

For example: extract all external EC2 names for a given tag $mytag from our cached JSON:

cat output.json | jq -r ".Reservations[].Instances[] | select(.Tags | length > 0) | select( .Tags[].Value == \""$mytag"\") | {PublicDnsName} | flatten | .[0]"
If you analyze the query you might noticeEasy right? No JSON parsing at all!

If you dare to script using AWS CLI and jq check out the jq - Cheat Sheet and the online test tool jqplay.

Network Split Test Scripts

Today I want to share two simple scripts for simulating a network split and rejoin between two groups of hosts. The split is done by adding per-host network blackhole routes on each host for all hosts of the other group.

Please be careful with using this. Forgetting a blackhole route can result in long hours of debugging as this is something you probably rarely use nowadays.

Script Usage

./ <filter1> <filter2> <hosts>
./ <filter1> <filter2> <hosts>
The script expects SSH equivalency and sudo on the target hosts. The filters are grep patterns.


group1_filter=$1; shift group2_filter=$1; shift hosts=$*

hosts1=$(echo $hosts | xargs -n1 | grep "$group1_filter") hosts2=$(echo $hosts | xargs -n1 | grep "$group2_filter")

if [ "$hosts1" == "" -o "$hosts2" == "" ]; then echo "ERROR: Syntax: $0 " exit 1 fi

for h in $hosts1; do echo "backlisting other zone on $h" for i in $hosts2; do ssh $h sudo route add $i gw lo done done for h in $hosts2; do echo "Backlisting other zone on $h" for i in $hosts1; do ssh $h sudo route add $i gw lo done done


group1_filter=$1; shift group2_filter=$1; shift hosts=$*

hosts1=$(echo $hosts | xargs -n1 | grep "$group1_filter") hosts2=$(echo $hosts | xargs -n1 | grep "$group2_filter")

if [ "$hosts1" == "" -o "$hosts2" == "" ]; then echo "ERROR: Syntax: $0 " exit 1 fi

for h in $hosts1; do echo "De-blacklisting other zone on $h" for i in $hosts2; do ssh $h sudo route del $i gw lo done done for h in $hosts2; do echo "De-blacklisting other zone on $h" for i in $hosts1; do ssh $h sudo route del $i gw lo done done

Sequence definitions with kwalify

After guess-trying a lot on how to define a simple sequence in kwalify (which I do use as a JSON/YAML schema validator) I want to share this solution for a YAML schema.

So my use case is whitelisting certain keys and somehow ensuring their types. Using this I want to use kwalify to validate YAML files. Doing this for scalars are simple, but hashes and lists of scalar elements are not. Most problematic was the lists...

Defining Arbitrary Scalar Sequences

So how to define a list in kwalify? The user guide gives this example:
  type: seq
     - type: str
This gives us a list of strings. But many lists also contain numbers and some contain structured data. For my use case I want to exclude structured date AND allow numbers. So "type: any" cannot be used. Also "type: any" would'nt work because it would require defining the mapping for any, which in a validation use case where we just want to ensure the list as a type, we cannot know. The great thing is there is a type "text" which you can use to allow a list of strings or number or both like this:
  type: seq
     - type: text

Building a key name + type validation schema

As already mentioned the need for this is to have a whitelisting schema with simple type validation. Below you see an example for such a schema:
type: map
  "default_definition": &allow_hash
     type: map
         type: any

"default_list_definition": &allow_list type: seq sequence: # Type text means string or number - type: text

"key1": *allow_hash "key2": *allow_list "key3": type: str

=: type: number range: { max: 29384855, min: 29384855 }
At the top there are two dummy keys "default_definition" and "default_list_definition" which we use to define two YAML references "allow_hash" and "allow_list" for generic hashes and scalar only lists.

In the middle of the schema you see three keys which are whitelisted and using the references are typed as hash/list and also as a string.

Finally for this to be a whitelist we need to refuse all other keys. Note that '=' as a key name stands for a default definition. Now we want to say: default is "not allowed". Sadly kwalify has no mechanism for this that allows expressing something like
    type: invalid
Therefore we resort to an absurd type definition (that we hopefully never use) for example a number that has to be exactly 29384855. All other keys not listed in the whitelist above, hopefully will fail to be this number an cause kwalify to throw an error.

This is how the kwalify YAML whitelist works.

PyPI does brownouts for legacy TLS

Nice! Reading through the maintenance notices on my status page aggregator I learned that PyPI started intentionally blocking legacy TLS clients as a way of getting people to switch before TLS 1.0/1.1 support is gone for real.

Here is a quote from their status page:

In preparation for our CDN provider deprecating TLSv1.0 and TLSv1.1 protocols, we have begun rolling brownouts for these protocols for the first ten (10) minutes of each hour.

During that window, clients accessing with clients that do not support TLSv1.2 will receive an HTTP 403 with the error message "This is a brown out of TLSv1 support. TLSv1 support is going away soon, upgrade to a TLSv1.2+ capable client.".

I like this action as a good balance of hurting as much as needed to help end users to stop putting of updates.

Puppet Agent Settings Issue

Experienced a strange puppet agent 4.8 configuration issue this week. To properly distribute the agent runs over time to even out puppet master load I wanted to configure the splay settings properly. There are two settings:

What first confused me was the "splay" was not on per-default. Of course when using the open source version it makes sense to have it off. Having it on per-default sounds more like an enterprise feature :-)

No matter the default after deploying an agent config with settings like this
runInterval = 3600
splay = true
splayLimit = 3600
... nothing happened. Runs were still not randomized. Checking the active configuration with
# puppet config print | grep splay
turned out that my config settings were not working at all. What was utterly confusing is that even the runInterval was reported as 1800 (which is the default value). But while the splay just did not work the effective runInterval was 3600!

After hours of debugging it, I happened to read the puppet documentation section that covers the config sections like [agent] and [main]. It says that [main] configures global settings and other sections can override the settings in [main], which makes sense.

But it just doesn't work this way. In the end the solution was using [main] as config section instead of [agent]:
and with this config "puppet config print" finally reported the settings as effective and the runtime behaviour had the expected randomization.

Maybe I misread something somewhere, but this is really hard to debug. And INI file are not really helpful in Unix. Overriding works better default files and with drop dirs.

Python re.sub Examples

Example for re.sub() usage in Python


import re

result = re.sub(pattern, repl, string, count=0, flags=0);

Simple Examples

num = re.sub('abc',  '',    input)           # Delete pattern abc
num = re.sub('abc',  'def', input)           # Replace pattern abc -> def
num = re.sub(r'\s+', ' ',   input)           # Eliminate duplicate whitespaces
num = re.sub('abc(def)ghi', r'\1', input)    # Replace a string with a part of itself
Note: Take care to always prefix patterns containing \ escapes with raw strings (by adding an r in front of the string). Otherwise the \ is used as an escape sequence and the regex won't work.

Advance Usage

Replacement Function

Instead of a replacement string you can provide a function performing dynamic replacements based on the match string like this:
def my_replace(m):
    if :
       return <replacement variant 1>
    return <replacement variant 2>

result = re.sub(r"\w+", my_replace, input)

Count Replacements

When you want to know how many replacements did happen use re.subn() instead:
result = re.sub(pattern, replacement, input)
print ('Result: ', result[0])
print ('Replacements: ', result[1])

See also: Python Syntax Python re.match Python re.sub

Helm Error: cannot connect to Tiller

Today I ran "helm" and got the following error:

$ helm status
Error: could not find tiller
It took me some minutes to find the root cause. First thing I thought was, that the tiller installation was gone/broken, which turned out to be fine. The root cause was that the helm client didn't select the correct namespace and probably stayed in the current namespace (where tiller isn't located).

This is due to the use of an environment variable $TILLER_NAMESPACE (as suggested in the setup docs) which I forgot to persist in my shell.

So running
$ TILLER_NAMESPACE=tiller helm status
solved the issue.

Using Linux keyring secrets from your scripts

When you write script that need to perform remote authentication you don't want to include passwords plain text in the script itself. And if the credentials are personal credentials you cannot deliver them with the script anyway.


Since 2008 the Secret Service API is standardized via and is implemented by GnomeKeyring and ksecretservice. Effectivly there is standard interface to access secrets on Linux desktops.

Sadly the CLI tools are rarely installed by default so you have to add them manually. On Debian
apt install libsecret-tools

Using secret-tool

There are two important modes:

Fetching passwords

The "lookup" command prints the password to STDOUT
/usr/bin/secret-tool lookup <key> <name>

Storing passwords

Note that with "store" you do not pass the password, as a dialog is raised to add it.
/usr/bin/secret-tool store <key> <name>

Scripting with secret-tool

Here is a simple example Bash script to automatically ask, store and use a secret:

ST=/usr/bin/secret-tool LOGIN="my-login" # Unique id for your login LABEL="My special login" # Human readable label

get_password() { $ST lookup "$LOGIN" "$USER" }

password=$( get_password ) if [ "$password" = "" ]; then $ST store --label "$LABEL" "$LOGIN" "$USER" password=$( get_password ) fi

if [ "$password" = "" ]; then echo "ERROR: Failed to fetch password!" else echo "Credentials: user=$USER password=$password" fi

Note that the secret will appear in the "Login" keyring. On GNOME you can check the secret with "seahorse".

How to install Helm on Openshift

This is a short summary of things to consider when installing Helm on Openshift.

What is Helm?

Before going into details: helm is a self-proclaimed "Kubernetes Package Manager". While this is not entirly false in my opinion it is three thingsWhen looking closer it does more of the stuff that automation tools like Puppet, Chef and Ansible do.

Current Installation Issues

Since kubernetes v1.6.1, which introduced RBAC (role based access control) it became harder to properly install helm. Actually the simple installation as suggested on the homepage
# Download and...
helm init
seems to work, but as soon as you run commands like
helm list
you get permission errors. This of course being caused by the tighter access control now being in place. Sadly even now being at kubernetes 1.8 helm still wasn't updated to take care of the proper permissions.

Openshift to the rescue...

As Redhat somewhat pioneered RBAC in Openshift with their namespace based "projects" concept they are also the ones with a good solution for the helm RBAC troubles.

Setting up Helm on Openshift

Client installation (helm)

curl -s | tar xz
sudo mv linux-amd64/helm /usr/local/bin
sudo chmod a+x /usr/local/bin/helm

helm init --client-only

Server installation (tiller)

With helm being the client only, Helm needs an agent named "tiller" on the kubernetes cluster. Therefore we create a project (namespace) for this agent an install it with "oc create"
export TILLER_NAMESPACE=tiller
oc new-project tiller
oc project tiller
oc process -f -p TILLER_NAMESPACE="${TILLER_NAMESPACE}" | oc create -f -
oc rollout status deployment tiller

Preparing your projects (namespaces)

Finally you have to give tiller access to each of the namespaces you want someone to manage using helm:
export TILLER_NAMESPACE=tiller
oc project 
oc policy add-role-to-user edit "system:serviceaccount:${TILLER_NAMESPACE}:tiller"
After you did this you can deploy your first service, e.g.
helm install stable/redis --namespace 

See also Helm - Cheat Sheet kubernetes - Cheat Sheet Openshift - Cheat Sheet


See also ulimit - Cheat Sheet

Sometimes you need to increase the open file limit for an application server or the maximum shared memory for your ever-growing master database. In such a case you edit your /etc/security/limits.conf and then wonder how to get the changed limits to be visible to check wether you have set them correctly. You do not want to find out that they were wrong after your master DB doesn't come up after some incident in the middle of the night...

Instant Applying Limits to Running Processes

Actually you might want to apply the changes directly to a running process additionally to changing /etc/security/limits.conf. In recent edge Linux distributions (e.g. Debian Jessie) there is a tool "prlimit" to get/set limits.

Usage for changing limits for a PID is

prlimit --pid <pid> --<limit>=<soft>:<hard>
for example
prlimit --pid 12345 --nofile=1024:2048
If you are unlucky and do not have prlimit yet check out "man 2 prlimit" for instructions on how to compile your own version because despite missing user tool the prlimit() system call is in the kernel for quite a while (since 2.6.36).

Alternative #1: Re-Login with "sudo -i"

If you do not have prlimit yet and want a changed limit configuration to become visible you might want to try "sudo -i". The reason: you need to re-login as limits from /etc/security/* are only applied on login!

But wait: what about users without login? In such a case you login as root (which might not share their limits) and sudo into the user: so no real login as the user. In this case you must ensure to use the "-i" option of sudo:
sudo -i -u <user>
to simulate an initial login with sudo. This will apply the new limits.

Alternative #2: Make it work for sudo without "-i"

Wether you need "-i" depends on the PAM configuration of your Linux distribution. If you need it then PAM probably loads "" only in /etc/pam.d/login which means at login time but no on sudo. This was introduced in Ubuntu Precise for example. By adding this line

session    required
in /etc/pam.d/sudo limits will also be applied when running sudo without "-i". Still using "-i" might be easier.

Finally: Always Check Effective Limits

The best way is to change the limits and check them by running
prlimit               # for current shell
prlimit --pid <pid>   # for a running process
because it shows both soft and hard limits together. Alternatively call
ulimit -a                # for current shell
cat /proc/<pid>/limits   # for a running process
with the affected user.

Nagios Check

You might also want to have a look at the nofile limit Nagios check.

Nagios Check Plugin for nofile Limit

Following the recent post on how to investigate limit related issues which gave instructions what to check if you suspect a system limit to be hit I want to share this Nagios check to cover the open file descriptor limit. Note that existing Nagios plugins like this only check the global limit, only check one application or do not output all problems. So here is my solution which does:

  1. Check the global file descriptor limit
  2. Uses lsof to check all processes "nofile" hard limit
It has two simple parameters -w and -c to specify a percentage threshold. An example call:
./ -w 70 -c 85
could result in the following output indicating two problematic processes:
WARNING memcached (PID 2398) 75% of 1024 used CRITICAL apache (PID 2392) 94% of 4096 used
Here is the check script doing this:

# Check "nofile" limit for all running processes using lsof

MIN_COUNT=0 # default "nofile" limit is usually 1024, so no checking for # processes with much less open fds needed

WARN_THRESHOLD=80 # default warning: 80% of file limit used CRITICAL_THRESHOLD=90 # default critical: 90% of file limit used

while getopts "hw:c:" option; do case $option in w) WARN_THRESHOLD=$OPTARG;; c) CRITICAL_THRESHOLD=$OPTARG;; h) echo "Syntax: $0 [-w <warning percentage>] [-c <critical percentage>]"; exit 1;; esac done

results=$( # Check global limit global_max=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 3) global_cur=$(cat /proc/sys/fs/file-nr 2>&1 |cut -f 1) ratio=$(( $global_cur * 100 / $global_max))

if [ $ratio -ge $CRITICAL_THRESHOLD ]; then echo "CRITICAL global file usage $ratio% of $global_max used" elif [ $ratio -ge $WARN_THRESHOLD ]; then echo "WARNING global file usage $ratio% of $global_max used" fi

# We use the following lsof options: # # -n to avoid resolving network names # -b to avoid kernel locks # -w to avoid warnings caused by -b # +c15 to get somewhat longer process names # lsof -wbn +c15 2>/dev/null | awk '{print $1,$2}' | sort | uniq -c |\ while read count name pid remainder; do # Never check anything above a sane minimum if [ $count -gt $MIN_COUNT ]; then # Extract the hard limit from /proc limit=$(cat /proc/$pid/limits 2>/dev/null| grep 'open files' | awk '{print $5}')

# Check if we got something, if not the process must have terminated if [ "$limit" != "" ]; then ratio=$(( $count * 100 / $limit )) if [ $ratio -ge $CRITICAL_THRESHOLD ]; then echo "CRITICAL $name (PID $pid) $ratio% of $limit used" elif [ $ratio -ge $WARN_THRESHOLD ]; then echo "WARNING $name (PID $pid) $ratio% of $limit used" fi fi fi done )

if echo $results | grep CRITICAL; then exit 2 fi if echo $results | grep WARNING; then exit 1 fi

echo "All processes are fine."
Use the script with caution! At the moment it has no protection against a hanging lsof. So the script might mess up your system if it hangs for some reason. If you have ideas how to improve it please share them in the comments!

Solving rtl8812au installation on Ubuntu 17.10

After several fruitless attempts on getting my new dual band Wifi stick to work on my PC I went the hard way to compiling the driver.

Figuring out which driver to use

As drivers do support device ids the first thing is to determine the id
$ lsusb
Bus 002 Device 013: ID 0bda:a811 Realtek Semiconductor Corp. 
So the id being "0bda:a811" you can search online for a list of driver names. Google suggests rtl8812au as related searches...

Finding a source repo

At github you can find several source repos for the rtl8812au driver in of different age. It seems that Realtek is supplying the source on some driver CDs and different people independently put them online. The hard part is to find the most recent one, as only this has fixes for recent kernels.

One with patches for kernel 4.13.x is the one from Vital Koshalev which has already a PR against the most commonly referenced repo of diederikdehaas which doesn't work yet!.

Getting the right source

So fetch the source as following
git clone
git checkout driver-4.3.22-beta-mod

Compilation + Installation of rtl8812AU

The instructions are from the and are to be run as root:
mkdir /usr/src/${DRV_NAME}-${DRV_VERSION}
git archive driver-${DRV_VERSION} | tar -x -C /usr/src/${DRV_NAME}-${DRV_VERSION}
dkms add -m ${DRV_NAME} -v ${DRV_VERSION}
dkms build -m ${DRV_NAME} -v ${DRV_VERSION}
dkms install -m ${DRV_NAME} -v ${DRV_VERSION}
See also DKMS - Cheat Sheet

Loading rtl8812AU

If everything worked well you should now be able to issue
modprobe 8812au
Note the missing "rtl"!!!

If the module loads, but the wifi stick doesn't work immediately it might be that the rtlwifi driver is preventing the self-compiled module from working. So remove it with
rmmod rtlwifi
It will complain about dependencies. You need to rmmod those too. Afterwards the new driver should load properly. To make disabling rtlwifi persistent add it to the modprobe blacklist:
echo "blacklist rtlwifi" >>/etc/modprobe.d/blacklist.conf
Please leave feedback on the instructions if you have problems!

Ensure secure Javascript dependencies

When you write Javascript code or when you want to know if a 3rd party code bases dependencies are secure check out which is an online scanner for github repos package.json contents.

This tool is able to generate badges and gives you details on dependencies

Here is a screenshot of some vulnerable deps

and the badge as seen on the corresponding github page:

While I do not like the badge explosion on it still is an amazingly useful tool to know the issue with this library just looking at the github project.

Openshift S2I and Spring profiles

When porting Springboot applications to Openshift using S2I (source to image) directly from a git repo you cannot rely on a start script passing the proper<profile name> parameter like this

java -jar yourApplication.jar
The only proper ways for injecting application configuration are
  1. Add it to the artifact/repository (meeh)
  2. Mount it using a config map
  3. Pass it via Docker environment variables
And for Openshift variant #3 works fine as there is an environment variable SPRING_PROFILES_ACTIVE which you can add in your deployment configuration and set it to your favourite spring profile name.

USB seq nnnn is taking a long time

When your dmesg/journalctl/syslog says something like

Nov 14 21:43:12 Wolf systemd-udevd[274]: seq 3634 '/devices/pci0000:00/0000:00:14.0/usb2/2-1' is taking a long time
then know that the only proper manly response can be
systemctl restart udev
Don't allow for disrespectful USB messages!!!