The Data Platform: moving from Buckaroo! to building with Lego

buckarooo
The game centres on a model of a donkey named “Roo” (or “Buckaroo”). The mule begins the game standing on all four feet, with a blanket on its back. Players take turns placing various items onto the mule’s back without causing the mule to buck up on its front legs, throwing off all the accumulated items.” – Wikipedia, Buckaroo!
Since our migration to Redshift almost exactly 2 years ago, we’ve been playing a background game of Buckaroo. “Roo” in this case being the framework that controls our ETL*. Left alone, the framework is stable and solid. It is the beating heart of the data platform. However if we wanted to add new features (e.g. data quality checks) we risked the whole thing bucking up on its front legs and throwing off all the accumulated items.
Why is this? The framework was built by a 3rd party as one huge app. It works well for what it was designed for but it is monolithic. It has highly interdependent code making it extremely difficult to know what impact making a change will have.
What are we doing? Modularisation. This is where the Lego comes in. Having a framework that is modular (i.e. separate pieces of functionality) allows us to confidently make changes and additions without fear of things tumbling down. This process involves taking functionality currently inside the framework and making it run outside of it as a separate service.
Where have we got to? The first new service we have created redefines how data gets into the framework. In addition to creating the new, more reliable module we have added file validation. Meaning any data coming in will now be checked to see if it looks correct. This was not previously possible.
What’s next? This is laying the foundation for a more reliable, faster and more flexible framework. Modularisation will allow us to:
  • Add more data quality checks and monitoring
  • Remove load from Redshift (speeding it up)
  • Provide more ways to move data to other applications
  • Increase data security and transparency (think GDPR!)
All great work, especially from Michal Huniewicz and Vladimir Parshikov, with more to come!
*ETL (Extract Transform Load) is how we get data from one place to another, e.g. taking article data from the Content API and putting it into Redshift.