Automation

Puppet Agent Noop Pitfalls

The puppet agent command has a --noop switch that allows you to perform a dry-run of your Puppet code.

puppet agent -t --noop

It doesn't change anything, it just tells you what it would change. More or less exact due to the nature of dependencies that might come into existance by runtime changes. But it is pretty helpful and all Puppet users I know use it from time to time.

Unexpected Things

But there are some unexpected things about the noop mode:

  1. A --noop run does trigger the report server.
  2. The --noop run rewrites the YAML state files in /var/lib/puppet
  3. And there is no state on the local machine that gives you the last "real" run result after you overwrite the state files with the --noop run.

Why might this be a problem?

Or the other way around: why Puppet think this is not a problem? Probably because Puppet as an automation tool should overwrite and the past state doesn't really matter. If you use PE or Puppet with PuppetDB or Foreman you have an reporting for past runs anyway, so no need to have a history on the Puppet client.

Why I still do not like it: it avoids having safe and simple local Nagios checks. Using the state YAML you might want to build a simple script checking for run errors. Because you might want a Nagios alert about all errors that appear. Or about hosts that did not run Puppet for quite some time (for example I wanted to disable Puppet on a server for some action and forgot to reenable). Such a check reports false positives each time someone does a --noop run until the next normal run. This hides errors.

Of course you can build all this with cool Devops style SQL/REST/... queries to PuppetDB/Foreman, but checking state locally seems a bit more the old-style robust and simpler sysadmin way. Actively asking the Puppet master or report server for the client state seems wrong. The client should know too.

From a software usability perspective I do not expect a tool do change it's state when I pass --noop. It's unexpected. Of course the documentation is carefull phrased:

Use 'noop' mode where the daemon runs in a no-op or dry-run mode. This is useful for seeing what changes Puppet will make without actually executing the changes.

Puppet Apply Only Specific Classes

If you want to apply Puppet changes in an selective manner you can run

puppet apply -t --tags Some::Class

on the client node to only run the single class named "Some::Class".

Why does this work? Because Puppet automatically creates tags for all classes you have. Ensure to upper-case all parts of the class name, because even if you actual Ruby class is "some::class" the Puppet tag will be "Some::Class".

Puppet Check ERBs for Dynamic Scoping

If you ever need to upgrade a code base to Puppet 3.0 and strip all dynamic scoping from your templates:

for file in $( find . -name "*.erb" | sort); do 
    echo "------------ [ $file ]"; 
    if grep -q "<%[^>]*$" $file; then 
        content=$(sed '/<%/,/%>/!d' $file); 
    else
        content=$(grep "<%" $file); 
    fi;
    echo "$content" | egrep "(.each|if |%=)" | egrep -v "scope.lookupvar|@|scope\["; 
done

This is of course just a fuzzy match, but should catch quite some of the dynamic scope expressions there are. The limits of this solution are:

  • false positives on loop and declared variables that must not be scoped
  • and false negatives when mixing of correct scope and missing scope in the same line.

So use with care.

Puppet: List Changed Files

If you want to know which files where changed by puppet in the last days:

cd /var/lib/puppet
for i in $(find clientbucket/ -name paths); do
	echo "$(stat -c %y $i | sed 's/\..*//')       $(cat $i)";
done | sort -n

will give you an output like

[...]
2015-02-10 12:36:25       /etc/resolv.conf
2015-02-17 10:52:09       /etc/bash.bashrc
2015-02-20 14:48:18       /etc/snmp/snmpd.conf
2015-02-20 14:50:53       /etc/snmp/snmpd.conf
[...]

How to dry-run with chef-client

The answer is simple: do not "dry-run", do "why-run"!

chef-client --why-run
chef-client -W

And the output looks nicer when using "-Fmin"

chef-client -Fmin -W

As with all other automation tools, the dry-run mode is not very predictive. Still it might indicate some of the things that will happen.

Chef Gets Push in Q1/2014

Sysadvent features a puppetlabs sponsered article (yes, honestly, check the bottom of the page!) about chef enterprise getting push support. It is supposed to be included in the open source release in Q1/2014.

With this change you can use a push jobs cookbook to define jobs and an extended "knife" with new commands to start and query about jobs:

knife job start ...
knife job list

and

knife node status ...

will tell about job execution status on the remote node.

At a first glance it seems nice. Then again I feel worried when this is intended to get rid of SSH keys. Why do we need to get rid of them exactly? And in exchange for what?

Simple Chef to Nagios Hostgroup Export

When you are automatizing with chef and use plain Nagios for monitoring you will find duplication quite some configuration. One large part is the hostgroup definitions which usually map many of the chef roles. So if the roles are defined in chef anyway they should be sync'ed to Nagios.

Using "knife" one can extract the roles of a node like this

knife node show -a roles $node | grep -v "^roles:"

Scripting The Role Dumping

Note though that knife only shows roles that were applied on the server already. But this shouldn't be a big problem for a synchronization solution. Next step is to create a usable hostgroup definition in Nagios. To avoid colliding with existing hostgroups let's prefix the generated hostgroup names with "chef-". The only challenge is the regrouping of the role lists given per node by chef into host name lists per role. In Bash 4 using an fancy hash this could be done like this:

declare -A roles

for node in $(knife node list); do
   for role in $(knife node show -a roles $i |grep -v "roles" ); do
      roles["$role"]=${roles["$role"]}"$i "
   done
done

Given this it is easy to dump Icinga hostgroup definitions. For example

for role in ${!roles[*]}; do
   echo "define hostgroup {
   hostgroup_name chef-$role
   members ${roles[$role]}
}
"
done

That makes ~15 lines of shell script and a cronjob entry to integrate Chef with Nagios. Of course you also need to ensure that each host name provided by chef has a Nagios host definition. If you know how it resolves you could just dump a host definition while looping over the host list. In any case there is no excuse not to export the chef config :-)

Easy Migrating

Migrating to such an export is easy by using the "chef-" namespace prefix for generated hostgroups. This allows you to smoothly migrate existing Nagions definitions at your own pace. Be sure to only reload Nagios and not restart via cron and to do it at reasonable time to avoid breaking things.

Chef: How To Debug Active Attributes

If you experience problems with attribute inheritance on a chef client and watch the chef-client output without knowing what attributes are effective you can either look at the chef GUI or do the same on console using "shef" or in "chef-shell" in newer chef releases.

So run

chef-shell -z

The "-z" is important to get chef-shell to load the currently active run list for the node that a "chef-client" run would use.

Then enter "attributes" to switch to attribute mode

chef > attributes
chef:attributes >

and query anything you like by specifying the attribute path as you do in recipes:

chef:attributes > default["authorized_keys"]
[...]
chef:attributes > node["packages"]
[...]

By just querying for "node" you get a full dump of all attributes.

Solving chef-client Errors

Problem:

merb : chef-server (api) : worker (port 4000) ~ Connection refused - connect(2) - (Errno::ECONNREFUSED)

Solution: Check why solr is not running and start it

/etc/init.d/chef-solr start

Problem:

merb : chef-server (api) : worker (port 4000) ~ Net::HTTPFatalError: 503 "Service Unavailable" - (Chef::Exceptions::SolrConnectionError)

Solution: You need to check solr log for error. You can find

  • the access log in /var/log/chef/2013_03_01.jetty.log (adapt the date)
  • the solr error log in /var/log/chef/solr.log

Hopefully you find an error trace there.


Problem:

# chef-expander -n 1
/usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require': cannot load such file -- http11_client (LoadError)
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/vendor_ruby/em-http.rb:8:in `'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/vendor_ruby/em-http-request.rb:1:in `'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/vendor_ruby/chef/expander/solrizer.rb:24:in `'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/vendor_ruby/chef/expander/vnode.rb:26:in `'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/vendor_ruby/chef/expander/vnode_supervisor.rb:28:in `'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/vendor_ruby/chef/expander/cluster_supervisor.rb:25:in `'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
        from /usr/bin/chef-expander:27:in `
'

Solution: This is a gems dependency issue with the HTTP client gem. Read about it here: http://tickets.opscode.com/browse/CHEF-3495. You might want to check the active Ruby version you have on your system e.g. on Debian run

update-alternatives --config ruby

to find out and change it. Note that the emhttp package from Opscode might require a special version. You can check by listing the package files:

dpkg -L libem-http-request-ruby
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/libem-http-request-ruby
/usr/share/doc/libem-http-request-ruby/changelog.Debian.gz
/usr/share/doc/libem-http-request-ruby/copyright
/usr/lib
/usr/lib/ruby
/usr/lib/ruby/vendor_ruby
/usr/lib/ruby/vendor_ruby/em-http.rb
/usr/lib/ruby/vendor_ruby/em-http-request.rb
/usr/lib/ruby/vendor_ruby/em-http
/usr/lib/ruby/vendor_ruby/em-http/http_options.rb
/usr/lib/ruby/vendor_ruby/em-http/http_header.rb
/usr/lib/ruby/vendor_ruby/em-http/client.rb
/usr/lib/ruby/vendor_ruby/em-http/http_encoding.rb
/usr/lib/ruby/vendor_ruby/em-http/multi.rb
/usr/lib/ruby/vendor_ruby/em-http/core_ext
/usr/lib/ruby/vendor_ruby/em-http/core_ext/bytesize.rb
/usr/lib/ruby/vendor_ruby/em-http/mock.rb
/usr/lib/ruby/vendor_ruby/em-http/decoders.rb
/usr/lib/ruby/vendor_ruby/em-http/version.rb
/usr/lib/ruby/vendor_ruby/em-http/request.rb
/usr/lib/ruby/vendor_ruby/1.8
/usr/lib/ruby/vendor_ruby/1.8/x86_64-linux
/usr/lib/ruby/vendor_ruby/1.8/x86_64-linux/em_buffer.so
/usr/lib/ruby/vendor_ruby/1.8/x86_64-linux/http11_client.so

The listing above for example indicates ruby1.8.

Solving 100% CPU usage of Chef beam.smp (RabbitMQ)

Search for chef 100% cpu issue and you will find a lot of sugestions ranging from reboot the server, to restart RabbitMQ and often to check the kernel max file limit.

All of those do not help! What does help is checking RabbitMQ with

rabbitmqctl report | grep -A3 file_descriptors

and have a look at the printed limits and usage. Here is an example:

 {file_descriptors,[{total_limit,8900},
                    {total_used,1028},
                    {sockets_limit,8008},
                    {sockets_used,2}]},

In my case the 100% CPU usage was caused by all of the file handles being used up which for some reason causes RabbitMQ 2.8.4 to go into a crazy endless loop rarely responding at all.

The "total_limit" value is the "nofile" limit for the maximum number of open files you can check using "ulimit -n" as RabbitMQ user. Increase it permanently by defining a RabbitMQ specific limit for example in /etc/security/limits.d/rabbitmq.conf:

rabbitmq    soft   nofile   10000

or using for example

ulimit -n 10000

from the start script or login scripts. Then restart RabbitMQ. The CPU usage should be gone.

Update: This problem only affects RabbitMQ releases up to 1.8.4 and should be fixed starting with 1.8.5.

Syndicate content