Over the past five years I’ve come to experience the delights of Puppet, CFEngine and Chef across a wide range of deployments ranging from a couple of web servers to host two or three hundred sites, to thousands of servers underpinning an OpenStack-based cloud solution.

I’d like to share a couple of thoughts on what I’ve learned, how to avoid making the same mistakes that I’ve made, and how to ensure that the next time you reach for your modules or cookbooks, you do it in a structured and sensible manner.

1) Starting from scratch is *always* an option

There are many of us who have inherited various test and build systems with massive variation between the documentation and reality.  We have also probably created a few along the way without even realising it!

These systems provide us with a huge amount of Technical Debt and can often ensure that whilst there is no root cause for an outage, there are a myriad of small issues that were patched and those patches were then patched to solve something else, before the third layer of patches was applied but the error wasn’t apparent until the 10th level of patching at which point it was too late for a quick fix.

If you find yourself in this situation then once the panic has died down and the incident is over, I thoroughly recommend taking a step back and making sure that your design for your configuration management (and you did design it, didn’t you?) meets the requirements that you are now dealing with.

Many people will throw their hands up in horror at the idea of throwing away the broken things and rewriting them from scratch, however in my experience – in the same way that is should be quicker to rebuild a broken system than repair it – it is often quicker to rewrite something than refactor it.

Rewriting instead of refactoring often also provides an opportunity to re-assess the original design, which almost always ends up with a better design, better fault detection and better code.

2) Watch your dependencies…

If you’re anything like me, you have a “baseline” set of modules or cookbooks.  This are inevitably wrapped into a Chef-Role or a Puppet-Class and this is the first thing that gets applied to your servers once they are brought online.

I recently had an issue where I wanted to rebuild my Chef Server.  This in itself wasn’t an issue, however I then needed to upload the various cookbooks and data bags that made up my “baseline” Role, and that’s where the fun started.

For various reasons, I try to keep my cookbooks in “function-specific” git repositories which are then included as submodules in a “master” repo.  For example, I have a git repo that holds all the cookbooks I need for monitoring/metrics such as Icinga and Graphite, another for all the “basics” such as NTP, SSH and ResolvConf, another for “webservers” (Nginx, Apache, PHP, Rails etc.).  This is fine, however I found that then I went to upload the cookbooks to the chef-server, I hit the following issue:

  • Upload all the “basic” cookbooks
  • Run the “baseline” role on a client
    • It fails because “baseline” requires NRPE, which is in the monitoring git repo
  • Upload the NRPE cookbook from the monitoring git repo
    • It fails because it requires the SSL cookbook which is in the “security” repo
  • Upload the SSL Cookbook from security on its own
  • Upload NRPE
  • Run chef on the client and watch it pass.

This occurred because my design was originally different to how I ended up deploying servers.  I took a step back, looked at the options that I had, and redesigned my cookbook layouts and git repositories so that any “clients” (NRPE, NSCA, Munin-Node etc.) were in their own cookbooks as part of the “basic” repo, as well as their dependencies (SSL support on a system is pretty basic – why had I put this somewhere else to begin with?!) and now I can upload and build a server to the “baseline” standard without worrying about dependencies.

If you’re using Chef, the role-spaghetti plugin can help you here…

3. Don’t re-invent the wheel

So many times I see people trying to solve problems that have already been fixed.

There are some really smart people out there who have probably already solved your problem for you – don’t subscribe to NIH – trust me, the people above are probably smarter that we both are and their solutions will work.

 4. Get to as many conferences as you can

Simple one this – I’ve learnt more in the pub after a day at a conference whilst drinking with other delegates than I’ve ever learnt from books!

 

Do you have any Steps to add to the above? Why not leave a comment below and let me know how you have reached CM paradise…