Recent Posts

Workaround OpenSSH 7.0 Problems

OpenSSH 7+ deprecates weak key exchange algorithm diffie-hellman-group1-sha1 and DSA public keys for both host and user keys which lead to the following error messages:

Unable to negotiate with 172.16.0.10 port 22: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1
or a simple permission denied when using a user DSA public key or
Unable to negotiate with 127.0.0.1: no matching host key type found.
Their offer: ssh-dss
when connecting to a host with a DSA host key.

Workaround

Allow the different deprecated features in ~/.ssh/config
Host myserver
  # To make pub ssh-dss keys work again
  PubkeyAcceptedKeyTypes +ssh-dss

# To make host ssh-dss keys work again HostkeyAlgorithms +ssh-dss

# To allow weak remote key exchange algorithm KexAlgorithms +diffie-hellman-group1-sha1
Alternatively pass those three options using -o. For example allow the key exchange when running SSH
ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 <host>

Solution

Replace all your dss keys to avoid keys stopping to work. And upgrade all SSH version to avoid offering legacy key exchange algorithms.

Scan Linux for Vulnerable Packages

How do you know wether your Linux server (which has no desktop update notifier or unattended security updates running) does need to be updated? Of course an

apt-get update && apt-get --dry-run upgrade
might give an indication. But what of the package upgrades do stand for security risks and whose are only simple bugfixes you do not care about?

Check using APT

One useful possibility is apticron which will tell you which packages should be upgraded and why. It presents you the package ChangeLog to decided wether you want to upgrade a package or not. Similar but less details is cron-apt which also informs you of new package updates.

Analyze Security Advisories

Now with all those CERT newsletters, security mailing lists and even security news feeds out there: why can't we check the other way around? Why not find out:
  1. Which security advisories do affect my system?
  2. Which ones I have already complied with?
  3. And which vulnerabilities are still there?
My mad idea was to take those security news feeds (as a start I tried with the ones from Ubuntu and CentOS) and parse out the package versions and compare them to the installed packages. The result was a script producing the following output:

screenshot of lpvs-scan.pl

In the output you see lines starting with "CEBA-2012-xxxx" which is CentOS security advisory naming schema (while Ubuntu has USN-xxxx-x). Yellow color means the security advisory doesn't apply because the relevant packages are not installed. Green means the most recent package version is installed and the advisory shouldn't affect the system anymore. Finally red, of course meaning that the machine is vulnerable.

Does it Work Reliably?

The script producing this output can be found here. I'm not yet satisfied with how it works and I'm not sure if it can be maintained at all given the brittle nature of the arbitrarily formatted/rich news feeds provided by the distros. But I like how it gives a clear indication of current advisories and their effect on the system.

Maybe persuading the Linux distributions into using a common feed format with easy to parse metadata might be a good idea...

How do you check your systems? What do you think of a package scanner using XML security advisory feeds?

Do Not List Iptables NAT Rules Without Care

What I do not want to do ever again is running

iptables -L -t nat
on a core production server with many many connections.

And why?

Well because running "iptables -L" auto-loads the table specific iptables kernel module which for the "nat" table is "iptables_nat" which has a dependency on "nf_conntrack".

While "iptables_nat" doesn't do anything when there are no configured iptables rules, "nf_conntrack" immediately starts to drop connections as it cannot handle the many many connections the server has.

The probably only safe way to check for NAT rules is:
grep -q ^nf_conntrack /proc/modules && iptables -L -t nat

Gerrit Howto Remove Changes

Sometimes you want to delete a Gerrit change to make it invisible for everyone (for example when you did commit an unencrypted secret...). AFAIK this is only possible via the SQL interface which you can enter with

ssh <gerrit host>:29418 gerrit gsql
and issue a delete with:
update changes set status='d' where change_id='<change id>';
For more Gerrit hints check out the Gerrit Cheat Sheet

SSH ProxyCommand Examples

Use "ProxyCommand" in your ~/.ssh/config to easily access servers hidden behind port knocking and jump hosts.

Also check out the SSH - Cheat Sheet.

Use Gateway/Jumphost

Host unreachable_host
  ProxyCommand ssh gateway_host exec nc %h %p

Automatic Jump Host Proxying

Host «your jump host>
  ForwardAgent yes
  Hostname «your jump host>
  User «your user name on jump host>

# Note the server list can have wild cards, e.g. "webserver-* database*" Host «server list> ForwardAgent yes User «your user name on all these hosts> ProxyCommand ssh -q «your jump host> nc -q0 %h 22

Automatic Port Knocking

Host myserver
   User myuser
   Host myserver.com
   ProxyCommand bash -c '/usr/bin/knock %h 1000 2000 3000 4000; sleep 1; exec /bin/nc %h %p'

Nagios Plugin for dmesg Monitoring

So far I found no easy solution to monitor for Linux kernel messages. So here is a simple Nagios plugin to scan dmesg output for interesting stuff:

#!/bin/bash


SEVERITIES="err,alert,emerg,crit" WHITELIST="microcode: |\ Firmware Bug|\ i8042: No controller|\ Odd, counter constraints enabled but no core perfctrs detected|\ Failed to access perfctr msr|\ echo 0 > /proc/sys"

# Check for critical dmesg lines from this day date=$(date "+%a %b %e") output=$(dmesg -T -l "$SEVERITIES" | egrep -v "$WHITELIST" | grep "$date" | tail -5)

if [ "$output" == "" ]; then echo "All is fine." exit 0 fi

echo "$output" | xargs exit 1
"Features" of the script above: This script helped a lot to early on detect I/O errors, recoverable as well as corruptions. It even worked when entire root partition wasn't readable anymore, because then the Nagios check failed with "NRPE: unable to read output" which indicated that dmesg didn't work anymore. By always showing all errors from the entire day one cannot miss recovered errors that happened in non-office hours.

Another good thing about the check is detecting OOM kills or fast spawning of processes.

Providing Links into Grafana Templates

As a Grafana user it is not obvious how to share links of template based dashboards.

Grafana does not change the request URI to reflect template variables you might enter (e.g. the server name).

Solution

There is a hidden feature: you can pass all template values via URL parameters in the following syntax
var-<parameter name>=<value>
Example link:
http://mygrafana.local/#/dashboard/db/mydashboard?var-server=web01

Hubot Setup Problems

When setting up Hubot you can run into

Error: EACCES, permission denied '/home/xyz/.config/configstore/insight-yo.yml'
when installing Hubot with yeoman (check out Github #1292).

The solution is simple:

Recent Node.js with Hubot Hipchat Adapter

Today I had a strange issue when setting up Hubot with Hipchat according to the installation instructions from hubot-hipchat.

The build with

yo hubot --adapter hipchat
fails because it downloads the most recent hubot-hipchat NPM package 2.12.0 and then tries to extract 2.7.5 which of course fails.

The simple workaround is

Apply-changes-to-limits.conf-immediately

Sometimes you need to increase the open file limit for some application server or the maximum shared member for you ever-growing master database. In such a case you edit your /etc/security/limits.conf and then wonder how to get the changed limits to be visible to check wether you have set them correctly. You do not want to find out that they were wrong after your master DB doesn't come up after some incident in the middle of the night...

The best way is to change the limits and check them by running

ulimit -a
with the affected user.

Workaround: Re-Login with "sudo -i"

But often you won't see the changes. The reason: you need to re-login as limits are only applied on login!

But wait: what about users without login? In such a case you login as root (which might not share their limits) and sudo into the user: so no real login as the user. In this case you must ensure to use the "-i" option of sudo:
sudo -i -u <user>
to simulate an initial login with sudo. This will apply the new limits.

Make it work for sudo without "-i"

Wether you need "-i" depends on the PAM configuration of your Linux distribution. If you need it then PAM probably loads "pam_limit.so" only in /etc/pam.d/login which means at login time but no on sudo. This was introduced in Ubuntu Precise for example. By adding this line

session    required   pam_limits.so
in /etc/pam.d/sudo limits will also be applied when running sudo without "-i". Still using "-i" might be easier.

Instant Applying Limits to Running Processes

Actually you might want to apply the changes directly to a running process additionally to changing /etc/security/limits.conf. In bleeding edge Linux versions there is a tool "prlimit" to get/set limits. To find out if you already have it run

prlimit
It should print a list of all active limits. So if you have prlimit installed (comes with util-linux-2.21) use it like this
prlimit --pid <pid> --<limit>=<soft>:<hard>
for example
prlimit --pid 12345 --nofile=1024:2048
If you are unlucky and do not have prlimit yet check out this instruction to compile your own version because despite missing user tool the prlimit() system call is in the kernel for quite a while (since 2.6.36).

See also ulimit - Cheat Sheet

Port Knocking And SSH ProxyCommand

When you use a port knocker like knockd you might want to do the knocking automatically from your ~/.ssh/config using "ProxyCommand".

Example Config

Host myserver
   User myuser
   Host myserver.com
   ProxyCommand bash -c '/usr/bin/knock %h 1000 2000 3000 4000; sleep 1; exec /bin/nc %h %p'
It is important not to forget the "exec" before invoking netcat!

See also SSH - Cheat Sheet

Visualizing Configuration Drift with Polscan

In the last two months I've now spent quite some time revising and improving visualizations of a larger number of findings when managing many hosts. Aside from result tables that you can ad-hoc properly filter and group by some attribute, host maps group by certain groups are the most useful.

The Polscan Hosts Map

Here is a screenshot of the current implementation in Polscan.



Even without a legend present the traffic light colors give facts away easily. You probably easily spot the six red hosts that have a real problem. Also you see a certain spread of more common yellow warnings in all groups. With some of them being more affected then others.

Just visually checking this representation gives you already some insight about ~350 hosts. The same wouldn't work on a result list even if you group it by the same schema.

Good Grouping Possibilities

Having a host map of only ungrouped boxes would not be very helpful. After all we want to correlate findings, their scope and maybe guess at possible causes by group relationships.

If we group by let say "Puppet Role" and find all six red boxes in the same group then it is instantly clear that the Puppet code for this role broke.

Another time the "Subdomain Prefix" might pinpoint an issue. If only one subdomain is impacted the cause of a finding might be network related or it could be an organizational issue. Maybe the datacenter guys did some dirty stuff on all the hardware hosts.

The more groups I can apply against the hosts map for a certain finding type the more structural reasons for a finding I can test. To help with this Polscan has so called host group providers which query host group sources.

Current providers:In the best case you should not need to write own providers. I plan to add Nagios/Icinga providers because IT Ops often keep well defined host groups there.

Thinking more on the organizational level: what about providers forIf you implement such providers (and I did) you can quickly identify responsibility. You can limit the findings you want to work by filtering on your team and the product you are currently involved with. Additionally your can identify per-product/ per-team backlog, quality and effort based on the frequency and amount of findings.

Problems With Group Sources

In Nagios you can easily have multiple host groups per hosts, same in Chef you have run lists with multiple cookbooks and recipes. On a hosts map IMO each host box should exist only in one group. But which one?

So for Polscan it is important to only get the "primary" group. Usually there are some conventions that can help identify it. After all these are the groups you probably attach your notifications settings to. So it should be possible to write host group providers.

Conclusion

The "Hosts Map" view really helps in daily work. It creates a form of visibility I miss from existing Puppet related tools like Foreman that give me nothing but total numbers. Also it helps to recheck solved issues in just a few clicks without much thinking. Sending links to the host map helps a lot sharing structural findings with fellow sysadmins.

As the GUI is static HTML reading static JSON results only I plan to have a demo with a larger result set online soon. So stay tuned...

Usage Scenarios for Polscan

The generic sysadmin policy scanner for Debian based distros "Polscan" (https://github.com/lwindolf/polscan) I wrote about recently is coming further along. Right now I am focussing on how to get it really useful in daily work with a lot of systems, which usually means a lot of findings. And the question is: how does the presentation of the findings help you with working on all of them?

For me there are roughly four scenarios when working with any sort of auditing tool or policy scanner.

Possible Scenarios

1. Everything under control

Scenario: That's the easy one. Your system automation is top notch there are no messy legacy systems, no hacks, no old construction places, no migrations. Everything is polished and when a new issue appears you automate it away, and 10min later it gets silentely applied on all your systems.

Presentation of Findings: You are in control, so you have a top level view, birds eye perspective. You spot abberations and tackle them. You can optically find the rogue policies/groups with a red number. And if there are none you work on reducing warnings, because you are bored. You spend most of your time in the summary view waiting for the auditor to present him with full compliance to everything he asks :-)


2. I'm swamped!

Scenario: You are afraid of adding more policies, as it would look even worse. You feel like you never will be able to get a clean system and at the same time your professional pride tells you have to get it under control!

Presentation of Findings: If there is no progress you do not need to try to fix anything. So it's most important to check for progress. What you care most about is the trend curve of all the findings. It gives you hope that one day all systems will be clean.

The problem here is that a ternary state OK/WARNING/FAILED does not cover how policies have different priorities. And that 2 findings out of all 500 might be absolutely critical, while 200 others are low impact issues. A trending curve does not show that you have fixed the 2 critical ones, but it nags you about not fixing all those 500 problems.


3. Let's improve something today

Scenario: It's like scenario #2, but with a positive psychological perspective. You do not care that there are a lot of issues, but you are highly motivated to solve some of them. You browse through the results intending to pick low hanging fruits and will eliminate them with your "Just do it" attitude.

Presentation of Findings: Skimming results is important. Statistics also are because you want to work on stuff that affects a lot of systems. You would like to see metrics of your progress instantly.


What works already

I personally usually find myself in scenario #2, but I know colleguages often have the spontanuous motivation and perspective of scenario #3. And I believe in a small startup company with only a few systems being the sole sysadmin you might find yourself in scenario #1 (happy you!).

With all three scenarios being realistic use cases I want them to work in polscan. Currently the main screen of polscan looks like this:



So how are the different scenarios supported already and where not?
  1. Scenario #1: "Everything under control"
    • Overview with drill down links is implemented
    • Well supported scenario
  2. Scenario #2: "I'm swamped!"
    • Overview has 30 days trending graph for critical findings
    • Policy/Group drill down result views also have the trending graph
    • Progress is easy to track
    • Overview has 'New' and 'Solved' tables giving delta statistics
    • 'New' and 'Solved' result drill-down is still missing
  3. Scenario #3: "Let's fix something"
    • The per-policy grouping in the overview allows tackling large blocks of findings.
    • No support yet to group hosts (e.g. with same security updates) to work on those
    • No instant feedback on achievements

What I'm working on

Next things to improve the scenarios:I guess I stop here as to much concept takes away implementation time!

Nonetheless if you've read through here I want to hear your opinion!
What is your use case? In which mode are you working and what do you need most?

Building a Generic Sysadmin Policy Scanner

After writing the same scripts several times I decided it is time for a generic solution to check Debian servers for configuration consistency. As incidents and mistakes happen each organization collects a set of learnings (let's call it policies) that should be followed in the future. And one important truth is that the free automation and CM tools we use (Chef, Puppet, Ansible, cfengine, Saltstack...) allow to implement policies, but do not seem to care much about proofing correct automation.

How to ensure following policies?

But how to really ensure following these policies? The only way is by checking them and revisiting the check results frequently. One could build a script and send a daily/weekly mail report. This is always a custom solution and that's what I did several times already. So I do it one final time, but this times in a generic way.

Generic Policy Scanning

For me a generic configuration consistency / policy scanner has at least the following requirements:
  1. Optional generic pre-defined policies
  2. Optional custom user-defined policies
  3. Policies checked locally on the host
  4. Policies checked from CM systems
  5. Per host/hostgroup policy enabling
  6. Generic discovery of your hosts
  7. Dynamic per policy/group/host result filtering
  8. Customizable mail reports
  9. Result archival for audits
  10. Some simple trending
  11. Daily diffs, New findings, Resolved Isses
  12. Acknowledging Findings
I started implementing a simple solution (entirely bash and SSH based, realizing requirements 1,2,3,4,6,7,9,10) with https://github.com/lwindolf/polscan. It is quite easy to setup by configuring the type of and you can run it instantly with the default set of policy scanners (which of course not necessarily all make sense for all type of systems).

Implemented Scanners

By setting up the results and the static HTML (instructions in README.md) in some webserver document root you can browse through the results.

Screenshots

Result overview:

Filter details:

Debugging hiera-eyaml Encryption, Decryption failed

When Hiera works without any problems everything is fine. But when not it is quite hard to debug why it is not working. Here is a troubleshooting list for Hiera when used with hiera-eyaml-gpg.

hiera-eyaml-gpg Decryption failed

First check your GPG key list
gpg --list-keys --homedir=<.gnupg dir>
Check that at least one of the keys listed is in the recipients you use for decrypting. The recipients you used are either listed in your Hiera/Eyaml config file or in a file referenced from there.

To verify what you active config is run eyaml in tracing mode. Note that the "-t" option is only available in newer Eyaml versions (e.g. 2.0.5):
eyaml decrypt -v -t -f somefile.yaml
Trace output
[hiera-eyaml-core]           (Symbol) trace_given        =        (TrueClass) true              
[hiera-eyaml-core]           (Symbol) gpg_always_trust   =       (FalseClass) false             
[hiera-eyaml-core]           (Symbol) trace              =        (TrueClass) true              
[hiera-eyaml-core]           (Symbol) encrypt_method     =           (String) pkcs7             
[hiera-eyaml-core]           (Symbol) gpg_gnupghome      =           (String) /etc/hiera/.gnupg      
[hiera-eyaml-core]           (Symbol) pkcs7_private_key  =           (String) ./keys/private_key.pkcs7.pem
[hiera-eyaml-core]           (Symbol) version            =       (FalseClass) false             
[hiera-eyaml-core]           (Symbol) gpg_gnupghome_given =        (TrueClass) true              
[hiera-eyaml-core]           (Symbol) help               =       (FalseClass) false             
[hiera-eyaml-core]           (Symbol) quiet              =       (FalseClass) false             
[hiera-eyaml-core]           (Symbol) gpg_recipients_file =           (String) ./gpg_recipients  
[hiera-eyaml-core]           (Symbol) string             =         (NilClass)                   
[hiera-eyaml-core]           (Symbol) file_given         =        (TrueClass) true   
Alternatively try manually enforcing recipients and .gnupg location to make it work.
eyaml decrypt -v -t -f somefile.yaml --gpg-recipients-file=<recipients> --gpg-gnupghome=<.gnupg dir>
If it works manually you might want to add the keys ":gpg-recipients-file:" to hiera.yaml and ensure that the mandatory key ":gpg-gnupghome:" is correct.

Checking Necessary Gems

hiera-eyaml-gpg can be run with different GPG-libraries depending on the version you run. Check dependencies on Github.

A possible stack is the following
gem list
[...]
gpgme (2.0.5)
hiera (1.3.2)
hiera-eyaml (2.0.1)
hiera-eyaml-gpg (0.4)
[...]
The GEM gpgme additionally needs the C library
dpkg -l "*gpg*"
||/ Name                Version             Beschreibung
+++-===================-===================-======================================================
ii  libgpgme11          1.2.0-1.2+deb6u1    GPGME - GnuPG Made Easy

Using Correct Ruby Version

Another pitfall is running multiple Ruby versions. Ensure to install the GEMs into the correct one. One Debian consider using "ruby-switch" or manually running "update-alternatives" for "gem" and "ruby".

Ruby Switch

apt-get install ruby-switch
ruby-switch --set ruby1.9.1

update-alternatives

# Print available versions
update-alternatives --list ruby
update-alternatives --list gem

# Show current config update-alternatives --display ruby update-alternatives --display gem

# If necessary change it update-alternatives --set ruby /usr/bin/ruby1.9.1 update-alternatives --set gem /usr/bin/gem1.9.1
See also Puppet - Cheat Sheet

Debugging dovecot ACL Shared Mailboxes Not Showing in Thunderbird

When you can't get ACL shared mailboxes visible with Dovecot and Thunderbird here are some debugging tipps:

  1. Thunderbird fetches the ACLs on startup (and maybe at some other interval). So for testing restart Thunderbird on each change you make.
  2. Ensure the shared mailboxes index can be written. You probably have it configured like
    plugin {
      acl_shared_dict = file:/var/lib/dovecot/db/shared-mailboxes.db
    }
    Check if such a file was created and is populated with new entries when you add ACLs from the mail client. As long as entries do not appear here, nothing can work.
  3. Enable debugging in the dovecot log or use the "debug" flag and check the ACLs for the user who should see a shared mailbox like this:
    doveadm acl debug -u [email protected] shared/users/box
    • Watch out for missing directories
    • Watch out for permission issues
    • Watch out for strangely created paths this could hint a misconfigured namespace prefix
See also Mail - Cheat Sheet

The damage of one second

Update: According to the AWS status page the incident was a problem related to BGP route leaking. AWS does not hint on a leap second related incident as originally suggested by this post!

Tonight we had another leap second and not without suffering at the same time. At the end of the post you can find two screenshots of outages suggested by downdetector.com. The screenshots were taken shortly after midnight UTC and you can easily spot those sites with problems by the disting peak at the right site of the graph.

AWS Outage

What is common to many of the affected sites: them being hosted at AWS which had some problems.

Quote:
[RESOLVED] Internet connectivity issues

Between 5:25 PM and 6:07 PM PDT we experienced an Internet connectivity issue with a provider outside of our network which affected traffic from some end-user networks. The issue has been resolved and the service is operating normally.

The root cause of this issue was an external Internet service provider incorrectly accepting a set of routes for some AWS addresses from a third-party who inadvertently advertised these routes. Providers should normally reject these routes by policy, but in this case the routes were accepted and propagated to other ISPs affecting some end-user’s ability to access AWS resources. Once we identified the provider and third-party network, we took action to route traffic around this incorrect routing configuration. We have worked with this external Internet service provider to ensure that this does not reoccur.

Incident Details

Graphs from downdetector.com

Note that those graphs indicate user reported issues: