Big Flashing DevOps Thing

What you are about to read is a story of DevOps monitoring,
and the solution one brave team found.


DevOps

As the FT moves into the brave new world of DevOps a number of challenges have arisen. One of them, the topic of this article, was how best to be kept informed as to the health of our stack. Is everything working, is something working, is anything working? These questions are particularly relevant as we build a new API and the environment and the applications change constantly!  Keeping track of what is working and what isn’t has proven to be a royal pain!

Nagios

Nagios is a great tool that practically everyone uses in one form or another. But it doesn’t have a common interface across multiple nagios servers. Nagios kindly sends an email every time a check changes state. Emails are a great way to keep in touch with friends and to tell your boss what you think of him or her, but are a terrible method for communicating that your critical application has just crashed.

I imagine most people do exactly what I do; create a google filter to send all Nagios emails straight to the bin. *@ft.com > bin.

Our Nagios servers have been configured to check every important parameter, from basic disk and CPU  checks to HTTP, application, database and jconsole via Jolokia. All we need is some way to communicate clearly when a check fails.

Monitor Screen

We love monitor screens. We have them hanging from the ceiling all over the office. Displaying information. It looks like good information. I have no idea what all that information is.

Our screens have a viewing angle of about 10 degrees. They are small and have low contrast. The screens display too much information and cycle through different pages every 10 seconds or so. It never seems to show the page I want. I’m still waiting for a page to scroll by that informs me if my app is working.

I’m not bashing monitoring screens. They can be and often are extremely useful. It’s just ours are not.

 

Ops Cops

An Ops Cop is a reaction to the failure of the above two communication methods. An Ops Cop is a member of the team tasked for a week to keep an eye on Nagios emails, Monitor Screens, etc, and respond accordingly. I’m glad I have never been asked to be an Ops Cop; something about this seems wrong.

 

SAWS

None of the above satisfied our needs. Something is missing. When something fails I want an alarm bell, a siren, or a flashing light that is so bright my eyes explode. A warning system that is in everyone’s face. No escape. There should be no excuse for anyone to not know when something in the stack has broken. “What do you mean you didn’t know the site was down, there is a mongoose running around the office ! “

 

Introducing SAWS ! “Silvano’s Awesome Warning System”.

Well I did spend my evenings and weekends making this so forgive me the naming it after myself.

 

I scaled back my plans for a herd of mongoose and a 50 gigawatt light bulb. I bought a BlinkyTape, a “super-cool LED strip with full-color RGB LEDs and an integrated microcontroller”, with 60 independently controllable LEDs.

Bought some perspex, some glue, sellotape, nuts and bolts, wired it to a Raspberry Pi. Wrote some bash and Python ( OK, so, after my bad attempts a friend wrote most of the Python, thanks Mark ) . Now cue the music !

 

Conclusion

A good monitor system should display the health status of the stack to as many people as possible in as simple format as possible. The more people that know the health state of the stack the better chance of someone picking it up and resolving the issue quickly.

SAWS simply shows by grouping LED’s if each nagios server has an error. Green, orange, yellow, red and flashing red LED’s representing OK, Unknown, Warning, Critical or Critical for over 30 minutes. Blue LED’s swoosh back and forth like a Cylon to indicate the python script is running and the data is up to date.

 

Since installing the SAWS device above the desk of the content team the reaction has been surprising. It’s not an exaggeration to say people love it. The question now is why ?

Maybe because it displays just enough information and nothing more. When a nagios server has an alert the LED’s change to a corresponding colour, it doesn’t tell you what check is failing, just something is failing. It’s bright and colorful like candy and is visible from every corner of the office.

Another reason why people have taken to it might be because for the first time in human history there is a monitoring system that is fun !

“Hey; I think the photos have been edited, I don’t believe it can be that bright?”
“No, Way”

The source code can be found on GitHub : https://github.com/muce/SAWS

14 thoughts on “Big Flashing DevOps Thing”

  1. Fantastic post! A very pretty and practical solution to a tough problem. I suspect that another benefit of this will be added pressure on dev teams to fix flapping alerts.

  2. Awesome work!

    I was fooling around on last year’s company summit with an Arduino and a few LEDs, trying to sell roughly the same basic idea, but was lacking the glamour and elegance of this solution. My poor marketing attempts failed then, but thanks to you, now the team is inspired to build a similar installation here!

    This has the potential to become widely adopted, in my opinion, especially if you make steps to make it really fast to pick up, like open sourcing base code on GitHub.

    Cheers,
    Stefan

  3. Hi Stefan

    I have been planning to put the code on GitHub. I hope to do this in the next week. I will post the link here once done.

    Silvano

  4. Awesome cool.
    Are you aware that nagios can send you a text in addition to email? That is how we get notifications that servers or services are down, text to our cell phone and email.

  5. Love your work Silvano!

    I’m thinking of doing something similar in my office. I was looking at the source code. I’m no Python expert (or Nagios come to that) but I couldn’t see how you were picking up the feed from Nagios. Am I missing something?

    Thanx – Mark

  6. Hi Silvano,

    That’s a big help, thanks. Makes sense now :-).

    My blinkytape arrive this morning. Time to start playing!

    Thanx – Mark

  7. Hi Silvano,

    Starting looking at integrating with our Nagios environment properly today and hit the first hurdle. We’re using Nagios XI which doesn’t permit anonymous access. I can access the monitored server status through an API but it returns xml which I’ll need to wade through to get the status code. I’m presuming that you’re running Nagios Core and so don’t have these challenges?

    Thanx – Mark

Comments are closed.