What you are about to read is a story of DevOps monitoring,
and the solution one brave team found.
As the FT moves into the brave new world of DevOps a number of challenges have arisen. One of them, the topic of this article, was how best to be kept informed as to the health of our stack. Is everything working, is something working, is anything working? These questions are particularly relevant as we build a new API and the environment and the applications change constantly! Keeping track of what is working and what isn’t has proven to be a royal pain!
Nagios is a great tool that practically everyone uses in one form or another. But it doesn’t have a common interface across multiple nagios servers. Nagios kindly sends an email every time a check changes state. Emails are a great way to keep in touch with friends and to tell your boss what you think of him or her, but are a terrible method for communicating that your critical application has just crashed.
I imagine most people do exactly what I do; create a google filter to send all Nagios emails straight to the bin. *@ft.com > bin.
Our Nagios servers have been configured to check every important parameter, from basic disk and CPU checks to HTTP, application, database and jconsole via Jolokia. All we need is some way to communicate clearly when a check fails.
We love monitor screens. We have them hanging from the ceiling all over the office. Displaying information. It looks like good information. I have no idea what all that information is.
Our screens have a viewing angle of about 10 degrees. They are small and have low contrast. The screens display too much information and cycle through different pages every 10 seconds or so. It never seems to show the page I want. I’m still waiting for a page to scroll by that informs me if my app is working.
I’m not bashing monitoring screens. They can be and often are extremely useful. It’s just ours are not.
An Ops Cop is a reaction to the failure of the above two communication methods. An Ops Cop is a member of the team tasked for a week to keep an eye on Nagios emails, Monitor Screens, etc, and respond accordingly. I’m glad I have never been asked to be an Ops Cop; something about this seems wrong.
None of the above satisfied our needs. Something is missing. When something fails I want an alarm bell, a siren, or a flashing light that is so bright my eyes explode. A warning system that is in everyone’s face. No escape. There should be no excuse for anyone to not know when something in the stack has broken. “What do you mean you didn’t know the site was down, there is a mongoose running around the office ! “
Introducing SAWS ! “Silvano’s Awesome Warning System”.
Well I did spend my evenings and weekends making this so forgive me the naming it after myself.
I scaled back my plans for a herd of mongoose and a 50 gigawatt light bulb. I bought a BlinkyTape, a “super-cool LED strip with full-color RGB LEDs and an integrated microcontroller”, with 60 independently controllable LEDs.
Bought some perspex, some glue, sellotape, nuts and bolts, wired it to a Raspberry Pi. Wrote some bash and Python ( OK, so, after my bad attempts a friend wrote most of the Python, thanks Mark ) . Now cue the music !
A good monitor system should display the health status of the stack to as many people as possible in as simple format as possible. The more people that know the health state of the stack the better chance of someone picking it up and resolving the issue quickly.
SAWS simply shows by grouping LED’s if each nagios server has an error. Green, orange, yellow, red and flashing red LED’s representing OK, Unknown, Warning, Critical or Critical for over 30 minutes. Blue LED’s swoosh back and forth like a Cylon to indicate the python script is running and the data is up to date.
Since installing the SAWS device above the desk of the content team the reaction has been surprising. It’s not an exaggeration to say people love it. The question now is why ?
Maybe because it displays just enough information and nothing more. When a nagios server has an alert the LED’s change to a corresponding colour, it doesn’t tell you what check is failing, just something is failing. It’s bright and colorful like candy and is visible from every corner of the office.
Another reason why people have taken to it might be because for the first time in human history there is a monitoring system that is fun !
“Hey; I think the photos have been edited, I don’t believe it can be that bright?”
The source code can be found on GitHub : https://github.com/muce/SAWS