Over the past few years at the FT we have steadily been increasing our disaster recovery testing with planned failovers and deliberate acts of sabotage to our own estate. Taking a leaf from the Netflix simian army, we have gradually improved our services’ resilience with what we have learnt from these exercises. For example, powering down DNS or Active Directory servers have highlighted areas of weakness that we have been able to mitigate or remedy completely. Continue reading “Using sabotage to improve”
On Saturday mornings, my 5 year old son goes to football training. This means I stand on a cold school playing field next to some other cold (and socially awkward) dads making small talk about how cold it is. A few Saturdays ago a dad who works in the IT department for a large city bank was on the phone for large parts of the session. Working in IT there are certain words or expressions that exhibit the cocktail party effect.
So, without wishing to eavesdrop I could not help but overhear these phrases: “unix patching”, “how long till we can restore service” and “rollback plan”. I could tell he was dealing with a legacy piece of his IT estate, so I did not ask him why he did not have a resilient set up that allowed him to use A/B patching groups and chose not to comment that ‘planned downtime’ is still downtime and should not really be tolerated.