Alerting for Brains

Here is the output of our production monitoring. Please read this information and answer the question below:


Please name all 3 servers that are currently in a critical state.

Tricky huh? I actually have no idea if there are three servers or zero servers alerting. You didn’t expect me to actually read that information did you? You think that because I was writing a blog post I might have bothered? No chance. Nothing is worth that pain.

The fact of the matter is that email – the default output for most alerting systems – was never designed to handle real-time state transitions across thousands of monitors and checks. In a micro-services universe scalability challenges like this become scalability failures. As you build out services you will boil your alerting frog. If you are running DevOps (and you should be) then that means you are in hot water.

A colleague did the math:

That’s more than one alert every 60 seconds. If you try to keep up, then you will fail. You will add more value to your employer by reading a random novel, say Snow Crash by Neal Stephenson. This was not what I was doing when we realised we were failing to keep track of PROD servers (I was trying to write some code!) but I did read this while staying in San Diego for a conference. In the bar, a helpful Silicon Valley veteran wandered over to tell me that I was fifteen years behind the curve. Thanks, I think.

What does Neal Stephenson say about processing vast amounts of information? Mostly, he suggests that we can. In particular, he reminds us that we do. For example, a human face reveals thousands of clues about a speaker – in real time – that most of us have learned to follow, digest and act upon. The primary device in the book is a carefully crafted pattern, like white noise or ‘snow’ on a broken TV, that transmits a virus to your brain. He gives dramatic examples of getting information across by shaping it for processing by your brain, not by conscious rational effort but automatically by biological firmware.

Can we design monitoring that appeals to our biological firmware? Something so obvious we can just glance at it and know what we need to? While the boss was away I set about a secret experiment. Incidentally, it says a lot about FT Technology that this secret experiment was accepted, adopted in multiple teams, nominated for an internal award, and shown off – with attribution – by senior engineers at Velocity, CodeMotion and at DevOps Summit. See below for a link to the Velocity slides.

I started out with a Nagios web page which shows, in real time, when checks fail. If three failures happen in a row you get an alert, so this shows three times the data as in email. I captured all of it.

I screen-scraped that using TagSoup to get standard XML, and used XSLT to generate a version in JSON. I ran that process every 60 seconds to build up 24 hours of data and cached that on the heap of a little DropWizard microservice. I used Java Graphics2D and ImageIO to render a chart as a PNG and made it accessible on an HTML page with an auto-refresh. Later I added clickable hotspots, JSON APIs and I gave it a name “Nagios Chart”… but what did I see on the chart?

Patterns! I had not been able to predict what the patterns would be, but they were there as predicted. Was it possible to learn to interpret them? It was.

Here are some basic ones.

First, the big blue bar. That’s a warning which someone has acknowledged in Nagios. Essentially the server is marked as in maintenance mode. But isn’t 24 hours a long time to be in maintenance? It happens that this service is in the process of being commissioned, but the wide-blue-bar pattern is a trigger for team leaders to start asking questions.

Similarly, there is a bit of scattered noise. Perhaps there is a question there? That depends a lot on your systems but is usually tolerable. Another virtue of relying on your biology is that scattered alerting noise is not visually appealing. Why? It just isn’t  – your biology is filtering it out for you.

What about that strong vertical pattern of critical states? That kind of pattern usually reflects a dependency on a broken system. In this case the database cluster had a short wobble – another trigger for engineers to start asking questions.

This shot shows a collection of weak horizontal patterns. Two apparently unrelated systems are reporting critical state intermittently for short periods, and it has been going on for 24 hours at least. Time to start investigating what the cause is.

Delegates to Velocity apparently chuckled at this “firewall upgrade” that became a widespread outage. A classic tall pattern that also became broad, because the relevant team struggled to restore service. I guess this is the point of having a test environment!

Notice the firewall outage built up over time, like a storm-brewing? Here is that pattern again, this time associated with a database cluster which was having a bad day.

We dubbed this one the Croydon Skyscraper, after the tower at Saffron Square. The pink bars reflect trouble for Nagios Chart itself, and there is a mix of other alerts across the estate.

It was an issue with the VPN out of AWS and I’m very proud of this one. There was a clear link between a specific texture of alerts and a very specific problem. It is exactly what I wanted to get from the visualisation. When it came back, we knew exactly what to do.

The database again. The unique feature of our RDF databse is the way it pre-calculates new facts using Prolog-like rules, making writes naturally expensive. We also do replication, which adds another overhead.  When we import new data-items (which we do at serious scale) we see a rectangle of checks timing out for the apps that are involved with read operations.

Note that the smallest red bars on all these diagrams reflect a short one minute alert period, but in many cases they add up to a prolonged pattern of disruption. One-off outages like this are not reported by default in our Nagios configuration because it would cause too many alerts.

This visualisation is the only way to see certain categories of disruption and it makes other categories much easier to reason about. Non-disruptive wasteful alerts blend into the background. Over and over again shaping the available data in a way that respects our perceptual biology has driven up the availability of our systems – and we turned off all those emails.

Many thanks to Sarah Wells for images from her Velocity Conference talk slides, which featured this tool.

Many thanks also to Kenneth Ockland for the image of Sarah!