Having received 500 emails from various alerting systems overnight for the n’th day (where ‘n’ is any value that is larger than 400) in a row, I’ve decided to take a stand against so-called “monitoring systems” that just pump SPAM into my inbox each morning that is dismissed with a simple “right-click -> Mark all as read”.
I’ve decided to call this “Inbox Blindness” – it’s a bit like snow-blindness in that you can’t see where you’re meant to be going (or in this case what you’re meant to be fixing) because the amount of snow in front of (or data being presented to) you is so overwhelming.
I have encountered numerous times the mentality that “if we can ping it, then it’s online. If we can’t, there’s a problem and we need to know about it”. The solution is almost inevitably “let’s get it to send us X when it goes wrong so we can fix it quickly” where “X” is an email/jabber message/skype call/text message/pager alert and before you know it the methodology has been applied to all of your services (even the ones that are not business critical) and your inbox is rapidly filled with alerts…
“I know!”, says some bright spark, “let’s put a priority in the header/subject/body so we can filter the messages and prioritise them accordingly…”
“Great idea!”, says someone else, and before you know it you’re getting bombarded with alerts ranging from level 0 to level 10 based upon someone’s classification of what a “critical” issue is without consulting any of the business units that rely on those systems and you’re still none the wiser about what needs fixing.
“Email alerts aren’t not working!”, shouts someone, “I get too many emails now and I can’t keep track of them…”. At last! Could this be the voice of reason? “Let’s start sending SMS/Pager alerts for all the level 10’s so we know it’s important”, comes the reply…. you realise that it wasn’t the voice of reason at all, just the frantic wailing of someone who had too much data and not enough information.
“OK, so what’s the difference between Data and Information?” I hear you ask…
Data is:
groups of information that represent the qualitative or quantitative attributes of a variable or set of variables
whereas information is:
knowledge acquired through study or experience or instruction.
Basically we can have as much data thrown at us as our brains can cope with, but unless we have the ability to tun that data into knowledge, it is completely useless.
Recently I’ve started looking at Cucumber-Nagios as a way of checking systems are online. It allows you to test situations and scenarios instead of ports and response times. Let’s see an example:
How do you currently check that your corporate website is online?
- Ping the server, if it is up, assume the site must be online?
- Connect to the webserver’s ip address on port 80, if it responds, assume the site must be online?
- Browse to the website and check that it is displaying as it is meant to?
The truth is that only option 3 will tell you if your website is online. The server could be up, your http service could be running, but what if the vhost for your website has not been enabled or the code for your website is broken but the server still returns a 200 error as it is displaying a page? That’s where Cucumber-Nagios comes in.
The example feature on the cucumber-nagios front page has an excellent example that allows you to check if google is letting you search and it returns the output in the same format as NRPE or other Nagios checks:
Feature: google.com
It should be up
And I should be able to search for thingsScenario: Searching for things
When I visit “http://www.google.com”
And I fill in “q” with “wikipedia”
And I press “Google Search”
Then I should see “www.wikipedia.org”
What’s the first thing that you notice about the above? Yep, it’s written in “plain” English! This means that as long as you follow the Cucumber guidelines on writing “Given, When, Then” features, you can test the entire navigation of your site and even check database availability by performing actions on your website that require the database to be available.
Remember, you might be able to ping it, you might be able to connect to the port and you’ll definitely be able to send emails when you encounter what you perceive to be an issue, but if you’re only getting data, not information, then it’s completely useless to you…
The example feature on the cucumber-nagios front page has an excellent example that allows you to check if google is letting you search and it returns the output in the same format as NRPE or other Nagios checks:
LikeLike