Over the past few years at the FT we have steadily been increasing our disaster recovery testing with planned failovers and deliberate acts of sabotage to our own estate. Taking a leaf from the Netflix simian army, we have gradually improved our services’ resilience with what we have learnt from these exercises. For example, powering down DNS or Active Directory servers have highlighted areas of weakness that we have been able to mitigate or remedy completely.
We have always been keen on Business Continuity and Disaster Recovery (which are slightly different but intrinsically linked concepts). In fact our Business Continuity manager received an award for it a while back. However, no matter how many accolades you receive this is not an area you can afford to be complacent in, because when the brown stuff hits the whirly thing you need to be ready.
Earlier this year during a (what should have been) routine network switch upgrade we were hit by an interesting bug that resulted in the loss of both parts of a redundant pair of network switches. The impact was the (temporary) loss of almost all of the connectivity to one of our data centres and large parts of our private cloud. There were knock on effects to connectivity and access to our systems in other data centres and the public cloud.
We recovered reasonably quickly with little impact to our customers or the FT’s reputation. Some hasty rollbacks and reboots and the Financial Times could continue to cover business news rather than be business news.
We were sufficiently unnerved by what happened to want to better protect ourselves. Although we do regular individual system/service based disaster recovery tests (by failing over systems between data centres), we had never tested our ability to recover from an incident that impacts multiple applications/systems at the same time.
We had been talking about the Netflix chaos gorilla for a while before this incident. It gave us the impetus to actually get on with trying our first complete ‘black out’ test of a data centre.
As the Financial Times moves more systems into public clouds this sort of test will evolve into deliberately knocking out availability zones. However we will be using a hybrid environment for some time yet so knowledge of how our services behave, how the loss of a data centre will affect our customers, and how we recover from it is still vital.
Interestingly the Technology team’s perceptions of how hard getting ‘permission’ from senior management to carry out such a test was at odds with how easy it actually was. We really expected to have to fight for it. There was almost a sense of the team preparing for the test saying to each other “They are letting us do this? Are they crazy?”. But, we contacted the various business units who we predicted would be affected (Editorial, Customer Services, Finance, etc) and they were kind enough to say ‘go for it’.
So a couple of weekends ago we sent a brave network engineer to a dark, dank datacentre on the outskirts of London to yank out a couple of cables. Effectively a manual chaos gorilla, stopping all ingress and egress of network traffic to the datacentre.
Then we watched as our monitoring aggregation dashboards went red (there is a problem) or worse, grey (I can’t even connect to the monitoring tool).
The FT’s services are arranged in service tiers (platinum is the most important and from there the ranking goes down through gold, silver and bronze for services that we are willing to tolerate a little downtime on for the sake of agility and lower architectural investment). This gave us a good indication of what the priority order for checking and fixing services should be. The main website, www.ft.com, kept serving content throughout the duration of the blackout test, globally load balancing seamlessly. There was, however, no time for laurel resting as many other services (particularly those that can’t run in an active/active state) needed manual intervention to recover.
Customers of some of our ‘smaller’ sites may well have noticed maintainance pages or a hiatus in service. We also removed the paywall from www.ft.com for a short while in order to ensure our customers could get at our content while we recovered systems.
I cannot pretend that our readers were all blissfully unaware of the deliberately engineered crisis happening all over many of the FT’s products. The number of calls made to our customer services desk had an unfortunate uplift of around 22% in comparison with the previous 2 Saturdays. That said, that is only about 30 calls extra calls while we ‘took out’ a large part of our technology estate. The majority of those were either related to some delayed emails or people wishing to tweak their account details. The hope is that by paying that price now we can better serve our readers when we are really in trouble.
For many of our private cloud based services we use Site Recovery Manager which is a tried and tested tool. However (and crucially) we not only stretched that as far as we ever had, we also exposed numerous connectivity and DNS shaped issues, many of which we corrected on the day. After several hours checking, fixing and learning we declared ourselves satisfied and called the network engineer up to shove the cables back in.
For us the most important part of the test was the learning part. We dealt with many things at the time but we have also generated a list of tasks and user stories that will (and are already) helping us to be more confident about our ability to recover from a major incident.
We learnt a lot about the dependencies built up between systems in a hybrid environments. Your new node.js app may be behind an elastic load balancer and a have sophisticated auto scaling configuration, but if it relies on a single DNS entry for a postfix server back in your datacentre to send your customers confirmation mails it’s going to let you down some day. That said the FT’s newer public and private cloud based systems generally fared better than expected and better than the services in what was left of our legacy systems.
In an effort to glean as much ‘learning’ as possible from the test we have also looked at it from an incident management perspective as well as from a technology perspective. What worked and didn’t work applies not only to the systems but also to the way the teams responded at the time. To relentlessly rake over the coals of the day we organised a retrospective and sent out an anonymous questionnaire, to all those involved, to ask how well the day was managed. The results will be tremendously useful if (and when) a problem of this size occurs for real.
The net result of this exercise is not only more resilient services and teams better equipped to deal with a large scale incident, it is also a confidence in our own abilities and the systems we have built. As a measure of that confidence in the local pub at the end of the day over the proverbial ‘well earned’ the conversation quickly turned to what havoc we could wreak next. We have ambitious plans, and after proving we can do this once there is a tremendous appetite to do more of the same.
Postscript: Since starting writing this post we lost a datacentre in New Jersey for a couple of days. There were air conditioning issues for our hosting provider. This particular datacentre mostly hosts front end services and we were able to divert traffic back to the UK so it was less serious than the DR test described in this blog post. But still a pretty big problem to deal with. We are confident that our readers did not notice a thing. That said, we have still learnt from the incident. For example, some of our ‘fail stale’ configuration with our CDN provider did not work quite as expected. Crucially, however, our operations teams handled the situation brilliantly, this is partly due to their dedication (some unsocial hours were needed to nurse our systems along) but also partly due to exercises like the one described above.