Tips for in-house teams in a free market software culture

For the past 18 months I’ve been running a team at the Financial Times called the “Origami Team”. Our team mission is to:

  • reduce the time the other development teams spend repeating work
  • unify design across the FT digital products

People at the FT seem to really like my team and what we’re building – the adoption of our products is high. This means that sometimes people ask me for pointers on what we’re doing, so I thought I’d write that down.

In the post I’ll cover:

I should point out that I joined this team 2 years ago and a lot of the things I’m going to talk about in this post were already in place when I started, put there by the many brilliant people who worked here before me.

Free market software teams

I’ve had this blog post by my colleague Matt Chadburn bookmarked since I joined the Financial Times 2 years ago. It is a great insight into how the software culture at the FT works. I’ll leave him to describe it succinctly:

The last few projects I’ve worked on have broadly followed the mechanics of free-market economy, where teams are allowed and encouraged to pick the best value tools for the job at hand, be they things developed and supported by internal teams or external to the company.

So, if you’re already running a tooling/SaaS/platforms team in this environment, or about to start, the best piece of advice I have for you is to read that blog post by Matt.

In this post I’ll cover some practical things we actually do, and some fuzzier culture things that the Origami team believes are important.

What is your team really for?

Foundational to being a successful in-house services team in a free market software culture is deciding what to build and why. There are probably a lot of ways to determine this but here is Origami’s.

In the first paragraph, I described our team’s mission as:

  • reduce the time the other development teams spend repeating work
  • unify design across the FT digital products

These are amazing points to work against (I don’t know who came up with them – probably Andrew Betts). They could have been things like “build and maintain services to help team use images” or “maintain a component library” – both of which we do – however the success of our team is not determined by if we built something or not, it’s if we saved people some time or not. This mission allows us to keep quite free about what we’re doing, and it forces us to celebrate what actually matters – not “did we ship a tool” but “did people actually use it and did it save them time”

Having a team mission can be tricky to nail down but it is hugely empowering when you get it right. Examples of good missions could be “reduce number of cyber security incidents” or “reduce AWS spend across the business”.

Data

Once you’ve worked out what your team is for, it’s also probably important to work out how to measure it.

Origami has a couple of key performance indicators which we report on quarterly to the rest of the business. For our web services we report uptime and adoption which we can work out using Pingdom and logs which we send to Splunk. For our front end components system we look at the number of websites the FT has built which have shipped with one or more of our components. This is much trickier to measure.

Having some reasonably rich data about how people are using our stuff is also important for working out what things we could look at deprecating, which in turn, keeps the amount of code we have to maintain down.

Your tools and services are products

Because you’re building a tool to be used by (potentially not that many) other developers, it can be tempting to skip on the design and research. This is very rarely a good idea. In Origami, though we don’t aim to make everything we’ve built as fully finished as a consumer product, we still bring the core ideas and practices for building good products into the process.

This means doing some form of requirements gathering, prototyping, workshops with potential users, user research and iteration. Keep this in proportion with the thing you’re trying to build of course, but even a little CLI tool will double in usefulness if you’ve gotten feedback from at least one potential user before you ship it.

If you come from a consumer products background, where competition with other products is fierce, then this way of working should be obvious. If you’re used to having a closed market around your tools (ie people have to use them, they aren’t allowed to choose an alternative), this might not be something that was important before.

Getting people to use the thing you built

Say you have some tool and services and you’re pretty confident that they’re the right things for the job. They’re well engineered, they’re easy to use because you’ve involved users early on to remove any stumbling blocks, and they address a real problem that the developers at your company have.

That’s fantastic but your job still isn’t done.

Because you’re competing in a free market ecosystem not only do you need to build the right thing, but you also need to provide the following:

  • Feature requests / bug fixes – how do we decide what to work on
  • Support – how do we help people using Origami
  • Documentation – what technical documentation do we provide
  • Comms – how do we let people know about new features, deprecation, system outages

The way the Origami Team (4 software developers) does this is by making sure that, broadly, every person in the team understands that their job is writing software and also participating in everything listed above. It might not make sense for your team to work like this. An alternative approach could be to assign particular responsibilities in your team or hire specialists into these roles.

Team culture

Before I go into the practical aspects of comms, support, documentation and improvements, a quick note on team culture. I was lucky enough to be able to hire everyone into the current team (two external hires and one internal). When we interviewed, I made it clear that as well as writing code, the role in the team would involve talking to other developers a lot. Be that doing support for people who were trying to use our stuff, or responding to issues, or doing code reviews of patches submitted by other people.

I also encourage the team to be sociable with other teams in whichever forms they are happy with. For some of them it’s going to the pub after work, for some it’s playing board games at the FT’s board games night. For me personally (not a heavy drinker and honestly a bit too intimidated to go to the board games night) it’s just saying hello to people in the corridor and following up if people have asked questions to make sure their issue got resolved.

I haven’t hired a team of party animals or extroverts – but I’ve been careful to hire people who are comfortable with communicating with other developers (could be in person, or over Slack or email) and understand it’s part of their job to do that.

As the leader of such a team it’s up to me to help my team to do this work as best they can. This means being flexible about different communication preferences; don’t force people who prefer typing to go talk face to face, but help them if that would be a better option. I also have to do a bit of protecting my team from having to answer live support issues when they’re really busy or just having a bad day.

Bootcamps

A lot of development teams at the FT send their new hires on “bootcamps” which are (usually) week long rotations with other teams. For Origami, bootcamps are incredibly important because they lay the empathetic foundations for future work. Understanding how other teams work, what they’re doing with Origami, and what their time pressures are like, is an important part of working out how to help them.

Bootcamps also allow us to do some covert user research on how teams are using Origami. Sometimes getting into the nitty gritty of how a team is working day-to-day allows us to see things we can improve for them that they would not think to mention to us.

Secondments

Sometimes some work will come up that needs a very close integration with another team’s work. In these cases we will send someone from the Origami team on a secondment with the other team to build the new thing while physically sitting with them. This, like a bootcamp, is a great hack to build empathy with our end users. The end result is also less likely to have problems from not misunderstanding requirements as any questions along the way can be resolved with a quick conversation.

Feature requests / bug fixes

So – team culture tips out the way, here is a list of things we actually do.

Something that the Origami team does which is quite unusual at the FT is we don’t have a backlog and we don’t work in sprints. This works very well for us but we are a team of 4 developers, please don’t take this advice wholesale into your own team without thinking carefully first.

Many of the other development teams at the FT work in sprints and when they raise an issue or a feature that means they want it fixed as soon as possible. Putting their issue into our next sprint would be too slow for them, by which time they’re out of their own sprint. To prevent this, at any one time, the team is either working on some long term strategic play (usually something nobody has directly asked for but we think would help the team or the business decided at our 6 monthly planning away days) or they’re working on fixing a bug or adding a feature. When someone is blocked on something because of a problem with something Origami built, our top priority is to unblock them as quickly as possible.

Support

Good support pays dividends. Not only does it help to get people using your product, it also serves as an opportunity to find out what’s not really working with your product (missing features, bugs, unclear documentation), which means you can fix it for next time. As a team we work hard to keep all support interactions friendly and our response times quick.

In free market software culture, lots of support requests are much healthier than silence. Silence could mean you’ve made a completely flawless tool, but it could also mean it sucks and nobody can be bothered to tell you, which means they’ve found something better to use.

Slack

The FT uses Slack and the Origami team has a dedicated support channel there. People come to this channel with all kinds of issues and suggestions and everyone pitches in to answer questions (sometimes even people who aren’t on the team will help out!)

Keeping our support channel welcoming to newcomers is important to me for practical reasons, and on ethical grounds too. Sometimes this means challenging bad behaviour which requires some confidence and tact but is necessary.

A general measure for if you’re doing the approachable Slack thing well is imagining today is your first day and you’re a bit nervous. You go to the support channel and read some of the backlog. Does that backlog leave you feeling comfortable asking a question in that channel or is it full of people being a bit shirty with each other?

Email

Origami also has a shared email address which can be useful for people who aren’t avid Slack users. At the FT there are a lot of teams who don’t use Slack in the same way product and technology do, email is a good way for those groups to reach us.

Open meetings

Every Friday we have an open meeting that anyone can come to and ask questions. If someone contacts us with a more involved feature request we might ask them to come along to the open meeting so the whole team can hear about it and discuss it.

Workshops

We run a workshop for new starters or people who want a refresher on the team and its projects. The workshop lasts about two hours and we cover what the team does, how to use our projects, and how to get help if you need it. This is often people’s first contact with the team so we try to make sure it’s a positive and engaging session.

Pairing

We do barely any pair programming within the team, but it is invaluable when trying to offer support to someone else who has a problem with an Origami component.

Github issues

All of our repositories are public on Github so sometimes people will raise issues for us. As with all other forms of support request we welcome this and will try to make sure we respond quickly (though I’ll admit, this is one of the least effective ways of getting support from us as we have so many repositories it’s easy to miss things)

Documentation

If doing all of the support things above seems like it might eat a lot of your time then documentation is your friend here. Documentation is part of a product maturation process, where you move from a scrappy MVP to something more serious. For Origami we have some getting started guides on http://origami.ft.com, and every component has its own README which has some documentation that’s specific to that component.

Services like the Image Service and Polyfill.io host their own documentation.

To help keep our documentation consistent we have a (brief) style-guide here: https://github.com/financial-times/ft-origami

Poor documentation erodes confidence in your product which, like poor support, is a push to use one of your competitors.

Communications

We’ve covered what happens when people come to you – support requests, github issues, etc. Sometimes you’ll need to get messages out to people.

Big change comms

In Origami we do some comms if a big change is happening – maybe we’ve deprecated something, or made a significant release that we want people to update to.

One thing we take very seriously is not breaking things for our users unless it is completely necessary, and in the case that we have to, making sure everyone impacted has 6 months to make whatever migration or fix is needed.

When we make a serious breaking change to a web service, someone in the team will lead on the comms, and that includes writing a comms plan (most often this is a spreadsheet so we can track which emails we’ve sent and who has responded).

Our plan is usually something like this:

  • 6 months before the service switch off: do a general broadcast about the new service or change and how to migrate. Include why the new thing is better, and why we’re switching off the old thing. Try to explain the likely impact of not switching over. This type of general post would go on our internal social network (We use Workplace here which is Facebook’s offering for companies), and in our slack channel.
  • Also 6 months before service switch off: Send direct emails to all teams (usually the product owner, or a tech lead) using the old service to say the new service is live and the old one will be shut down in 6 months.
  • 3 months before service switch off: Identify who is still using the old service and send a follow up email. If it’s possible, include the product owner’s name and the names of their impacted projects in the email so they have a wider context for what we’re talking about.
  • 1 month before the service switch off: Again, identify who is still using the old service. If possible, chase them directly. Depending on the severity of not switching to the new thing, we might try a different contact method if we haven’t heard anything back from the team using the old service if they haven’t moved. This is the point at which we’ll also check to see if we can help people with migrating by sending a developer to work with the team.
  • The day of service switch off: Email everyone announcing the switch off, celebrate the cost saving of not having to support the old version of the service anymore.

Some tips for these kinds of messages (be they for Slack, email or Workplace).

  • Keep them informal, brief, and scannable.
  • Put the turn off date in every message in bold.
  • Include what will happen if people don’t switch over. Will they incur an HTTP redirect? Will something break?
  • Include how to get in touch if they have further questions.
  • Tell them you’ll email them again in n months
  • We like to use emojis in the subject line to help the email stand out in people’s inboxes. Some people hate emojis so – YMMV.
  • If the change-over instructions can be put in an email (ie they are short) then do this! If they are more involved, link to somewhere with a more detailed migration guide.
  • Get someone to proofread the email. Believe me, it is deeply embarrassing when you send out an email with a typo, especially when the typo is in the switch over instructions.

Comms around screwing up

Something that will help build trust between you and your users is how you deal with the situation in which one of your products has broken.

I think it’s virtually inevitable that one day your service will go down, or you’ll release a version of your tool which breaks something in a live site that uses it. The difference between regular teams and great teams is how they deal with the issue as it happens and what they do after the fact.

In Origami if someone reports that a release has broken something on a live site or a service has gone down, fixing that takes priority over all other work. While the fix is happening someone will also be in charge of comms. This means making sure the status of the fix is posted to the slack channel as it progresses, and, if things aren’t too hectic, that person will also start writing up the incident post mortem.

Our post mortems have three sections:

  • A summary of the incident. This should include what the problem was, what caused it (if you know), who was impacted, what the fix was, and any lasting issues.
  • A timeline of events. This is a list of relevant events including things like the first report of the incident, the time and point at which something was rolled back, or a patch was released. In the chaos of a live incident it can be really useful to have a timeline of what has been done. In the event that the issue is very difficult to fix or the incident goes on for a long time, having such a timeline will help new people who’ve come in to help.
  • A list of talking points. As you go through a live incident having a place at the bottom of the document to dump questions that come up or things you want to go back and fix later is really useful. This talking points list should be refined and then should form the basis for your post mortem meeting.

We publish our post mortems or send them around. We believe this is really important so everyone can learn from our mistakes. Once the incident has been dealt with we’ll also have a meeting about the incident. Anyone can attend these and we encourage anyone affected to come along. The main focus of this meeting is not to work out whose fault the incident was, but what we can learn from this incident and what we should change to prevent it happening in the future. We loosely follow the principles of blameless post mortems, which John Alspaw of Etsy has written about in detail here: https://codeascraft.com/2012/05/22/blameless-postmortems/

Broad reach comms

One thing we also do for Origami is broadcast messages to remind people we still exist and what we’re up to. These are usually quarterly Workplace posts, but sometimes they’re lightning talks. Developers don’t really care for these kinds of comms, but they’re good for reaching other people in the business who do have an impact on our future but don’t use Origami themselves (eg the finance team, product managers, etc)

Summary

As Matt points out in his original post on free market software teams, in-house teams have many significant advantages over third parties, especially when it comes to access to users. Where internal teams seem to go wrong is not appreciating that the thing they’re building is still a product and so it needs to compete with other products on the market. There isn’t a single one thing (eg “have a workshop”, “hire an advocate”) that you have to do to get an in-house team really competing with an external company, but lots of small things that help set the team up for creating the right products in the right way.

Credits

Thanks to @jakedchampion, @rowanmanning, @commuterjoy, @tekin and @jwheare for the proof reading and edits. ❤️

What Happens When You Visit ft.com?

The Financial Times front page.

This is an overview of how the Financial Times serves requests to www.ft.com. Starting with our domains, going all the way down to our Heroku applications, and through everything in between.

Table of Contents

  1. Domain Name System
  2. Content Delivery System
  3. Preflight
  4. Router
  5. Service Registry
  6. Applications
  7. Elasticsearch
  8. The End Result

Domain Name System (DNS)

We use Dyn, they are our name server provider. They are a single point of failure, but caching of our domain-name records should help during short outages.

Two of our most important domains are www.ft.com and ft.com(which is also known as the apex domain). Both domains point to our content delivery network (CDN).

For the www.ft.com subdomain we have a CNAME record pointing to f3.shared.global.fastly.net. This delegates the DNS resolution to our CDN, Fastly.

Our apex record however cannot contain a CNAME, so we instead use four A records.

Typically the CNAME record for www.ft.com will resolve to the same four A records as ft.com.

;; ANSWER SECTION:
www.ft.com.  3205 IN CNAME f3.shared.global.fastly.net.
f3.shared.global.fastly.net. 14 IN A 151.101.2.109
f3.shared.global.fastly.net. 14 IN A 151.101.66.109
f3.shared.global.fastly.net. 14 IN A 151.101.130.109
f3.shared.global.fastly.net. 14 IN A 151.101.194.109
;; ANSWER SECTION:
ft.com.   13336 IN A 151.101.2.109
ft.com.   13336 IN A 151.101.66.109
ft.com.   13336 IN A 151.101.130.109
ft.com.   13336 IN A 151.101.194.109
Anycast routing.

Fastly maintain servers in over 50 locations around the world, but we only see 4 IP addresses in our DNS queries.

So how does our traffic end up talking with the closest available Fastly server?

Fastly manage traffic on their network using the border gateway protocol and Anycastrouting, allowing them to send requests to the nearest point of presence while avoiding unplanned outages and locations that are down for maintenance.

Anycast is a network addressing and routing method in which datagrams from a single sender are routed to any one of several destination nodes, selected on the basis of which is the nearest, lowest cost, healthiest, with the least congested route, or some other distance measure.

Fastly route around outages in two ways. The first is at the DNS layer, updating their DNS records to avoid the problematic location. The second way is at the network layer, broadcasting new routes using BGP, this alters the path that a request’s TCP packets will take between routers.

At the end of all this we eventually connect to a Fastly server, so what happens next?

Content Delivery Network (CDN)

We use a CDN to reduce the number of requests made to our applications running in Heroku.

Much of our content is the same for all users, typically only a little different if you are logged in or not. If we cache these different versions in the CDN we can serve requests without even bothering the Heroku applications.

Caching

Our setup allows us to cache ~94% of all requests, with a cache hit rate of ~90%. So if we see something like 9,000,000 requests during a morning peak, by using the CDN’s cache we only pass on ~900,000 requests to our Heroku applications.

Fastly respect the Cache-Control or Surrogate-Control headers that our applications include in their response, as defined in the HTTP Caching Specification and the Edge Architecture Specification.

Let’s take a look at the caching headers for our home page (add a Fastly-Debug: 1 header to your request to see all these response headers).

GET / HTTP/1.1
Accept: */*
Host: www.ft.com
Fastly-Debug: 1
HTTP/1.1 200 OK
Age: 76
Content-Length: 41742
Content-Type: text/html; charset=utf-8
Date: Fri, 24 Nov 2017 09:24:39 GMT
Etag: W/"4fe7d-l04bmzZM7z5hmNTtslNXHn0d9L0"
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Surrogate-Control: max-age=86400, stale-while-revalidate=86400, stale-if-error=86400
Surrogate-Key: frontpage
Vary: Accept, Accept-Encoding

Here the main response headers we’re interested in are AgeCache-Control, and Surrogate-Control.

Age defines how long this response has been cached by Fastly, it also helps to indicate that this response was successfully served by the cache.

Cache-Control defines several directives, but in summary is saying this response should not be cached.

Surrogate-Control however is stating in the max-age that the response can be cached for 24 hours, the directives value being defined in seconds. In Fastly, a response with this header will be respected over any Cache-Controlheader. This allows us to define different caching rules for browsers and the CDN, as browsers ignore the Surrogate-Control header.

Serving Stale

We also define stale-while-revalidate and stale-if-error directives, which tells Fastly that we are happy to serve responses from the cache, even if the cached object’s Age has exceeded what’s defined in max-age.

stale-while-revalidate allows us to respond using a stale response while grabbing a fresh copy in the background, ensuring we’re responding to requests as quick as possible.

The difference between cache hit time and miss time shows why serving stale is so beneficial to our users.

stale-if-error is critical to how we deal with outages and errors, it tells Fastly to serve from the cache, including stale responses, if the backend is responding with errors. This gives us time to fix issues while reducing the impact to our users when things go wrong.

A quirk of Fastly means you must specify a max-age of over 61 minutes to ensure your response is cached on disk, and therefore available for a longer period of time in the CDN to serve stale. A cached object that’s only in memory can be removed for several reasons well before it’s deemed stale.

These two Cache-Control directives are part of an extension to the original Caching specification and are also supported in modern browsers.

Vary

The Vary header in the response is another part of the Caching specification.

It allows us to store different versions of a response depending on headers in the request.

Take the Accept-Encoding header in a request, and lets say we make two requests, the first with Accept-Encoding: gzip and the second with no such header, both to /search.

We will actually serve two different responses, the first will come back with a header of Content-Encoding: gzip and will be compressed using gzip. The second will not contain a Content-Encoding header and will be uncompressed.

It would be pretty bad for us to serve a compressed version of the page if the client does not ask for it. For this reason we must cache these responses separately, and this is where the Vary header comes in.

In this example we would respond with Vary: Accept-Encoding. This indicates that caches should store a separate version of the response depending on the value of the Accept-Encoding header in the request. Such caches include a client’s browser and our Fastly service.

For the website we actually take this a step further within Fastly and include several request headers that are decorated in preflight (as discussed later), so that when we serve different responses for A/B tests for example (see Vary: FT-Flags) we are still able to cache them in the CDN.

Purging

Given we tell Fastly to cache our front page for a whole day, how are we able to serve the latest version of the page to all our users?

By using the Fastly API we are able to purge the cached content. We also have an event driven system (using AWS Kinesis) that knows when content has changed, we can use this information to issue purge requests and serve the very latest news to our users.

Purge requests issued to Fastly on a typical Monday.

Fastly supports several types of purging. The most simple method is to issue a hard purge by URL, but this may result in a slower response for a few users.

Our autonomous systems make heavy use of soft purging by surrogate key, as this should result in no end user impact, and ensure all related content is purged, even if it exists on multiple URLs (e.g. //?edition=uk, and/?edition=international).

How does soft purging result in no end user impact? It is very similar to what we discussed earlier in our use of stale-while-revalidate. Soft purging in essence marks cached responses with the given surrogate key as stale, even if they are still fresh according to their max-age value. This then allows Fastly to serve the stale response until they’ve fetched a fresh version in the background.

Fastly’s point of presence around the planet.

The Fastly Black Box

The normal Fastly stack consists of a two layer system, first they have a HTTP proxy called h2o used to manage TLS termination, and then we have a popular caching HTTP reverse proxy called Varnish.

Fastly maintain their own fork of Varnish and have heavily modified it to suit their platform, so while this means we define our logic in Varnish Configuration Language (VCL) we must refer to the Fastly documentation more than Varnish’s.

The Fastly black box.

For www.ft.com however we are not using the h2o part of the Fastly black box. In order to support TLS 1.0 and 1.1 for IE 10 support we are instead pointing at a different bit of their infrastructure to handle the TLS termination.

Decorating Requests

An important part of what we do to a request in Fastly is decorating it with a whole bunch of metadata (e.g. session state, A/B test groups, etc.). This is handled by our Preflight application.

There’s a complex bit of VCL that passes the request to Preflight, takes the response and enriches the original request, then restarting the Varnish state machine to either serve the request from cache or fetch a fresh response from our applications.

What make us Platinum?

Simplifying what happens in Fastly, we allow their platform to do a bit of caching, and every now and then ask our applications for new content.

To be platinum we must be able to serve request from tworegions, to cater for an outage of a whole region. For us that means we run in Heroku’s EU and US regions.

When Fastly does talk to our applications it actually runs through a snippet of VCL that determines which region should serve the request. Ideally this is the closest region to our visitor (e.g. a request from New York should be served by the US Heroku region). However if a region is unhealthy, which we continually monitor for, our Fastly service will fallback to the other, hopefully healthy region.

Preflight

This is a Heroku application, which lives at https://github.com/Financial-Times/next-preflight.

Preflight forwards the user’s request for a web page to several other FT APIs in order to decorate the request with various properties.

Preflight gathers test information from our Ammit service, vanity URLs from our URL management service, subscription information from membership’s Access service, barrier page information from our Barrier Guru service, and finally session information from membership’s Session service.

By doing this in Preflight in combination with Fastly we avoid having to do all this work in each of our applications, they can just make use of the decorated request.

Router

This is another Heroku application, but it is a little different from our typical Express.js applications.

It lives at https://github.com/Financial-Times/next-router.

The router is a simple streaming HTTP proxy that takes a request and passes it on to the correct application. We define where requests should be sent to in our service registry, for example requests to ^/search are directed to the search page Heroku application.

The ft.com router.

Service Registry

Our service registry is a basic JSON document that is hosted as a platinum service. It’s stored in S3 across two regions, and uses a similar setup to our ft.com Fastly service to serve from both regions.

Here’s a little example snippet with some extra details removed. You should be able to spot a path, ^/__foo-bar, and a Heroku app foo-bar-eu.

[
  {
    "name": "foo-bar",
    "description": "An example service.",
    "host": "www.ft.com",
    "tier": "bronze",
    "paths": [
      "^/__foo-bar"
    ],
    "nodes": [
      {
        "region": "EU",
        "url": "https://foo-bar-eu.herokuapp.com"
      }
    ],
    "repository": "https://github.com/Financial-Times/next-foo-bar"
  }
]

Heroku and the Host Header

As an aside, it is worth discussing how Heroku knows where to send requests.

Heroku is a platform that only supports HTTP/1.1 requests, as it depends on the Host header to know which application should receive a request.

This is why we have applications called foo-bar-eu.herokuapp.com and foo-bar-us.herokuapp.com so that we can set the Host header in the router and send them requests accordingly.

While you can add custom domains, for the reasons above you cannot set the same custom domain on two different Heroku apps.

Applications

This is our standard Heroku application, for example the front page, or stream page. These are your typical Node.js based application running on Heroku. We use the Express.js framework.

We use components to share common functionality between all our applications, some examples being n-express and n-ui.

Typically the data sources for these applications will either be our Elasticsearch clusters, or the Next API.

Once our application has handled the request, it’ll travel all the way back through the stack, hopefully be cached by Fastly, and then sent on to our browser 🙌.

Going Platinum

With our microservice based setup, no two applications are the same. Because of this while www.ft.com is a platinum service, we also don’t offer support for the whole site 24/7. Our range of service “metals” is either bronze or platinum, though you may see gold and silver mentioned around the rest of the company.

For example, www.ft.com/search is a bronze service, but www.ft.com/?edition=international is platinum.

The main difference between bronze and platinum is that a bronze service only needs to run in a single region, while a platinum service, as discussed previously, must operate in two regions.

Elasticsearch

We run a platinum tier Elasticsearch endpoint, using two highly available clusters in two distinct AWS regions.

These clusters are our store of all content for www.ft.com, and is addressed using a single DNS record.

How does it work? We use a service provided by Dyn called Traffic Director, allowing you to achieve similar routing results to what we do in Fastly for www.ft.com.

The domain has two pools of addresses, one points at the US Elasticsearch cluster, the other at the EU cluster. If everything is healthy then Dyn advertises the closest pool to the request. If a pool is unhealthy then Dyn will not advertise it, falling back to the other healthy pool.

The difference between this and how we achieve platinum in Fastly is that this setup is entirely DNS based, and so when issues occur we will be advertising a different CNAME record (whereas in Fastly this all happens inside Varnish).

The End Result

What follows is a simplified overview of our stack.

Engine Room Live 2017 – The Low Down

This year we held our third ‘Engine Room Live’ conference for the Product & Technology teams at the FT. It being the third time we have held this we had some previous learnings to bear in mind. The ‘original’ Engine Room committee decided it was time for ‘Gen 2’ to have a go at organising the event, for a fresh take on some hardy matters. So, with minimal hand holding and a solid process in mind, 12 people raised their hands..

Step 1. Make a plan

Our new planning committee held its first meeting all the way back in June. The first thing we did was pick a date. We scanned our diaries and set our sights on a time post summer holidays and the mad rush month that is September at the FT. We stumbled upon Friday 13th October. Were we asking for bad luck? Could this be a complete disaster? Never ones to be swayed by superstition we settled on it. Having four months to plan ahead we kicked back on our metaphorical laurels safe in the knowledge we had more than enough time to plan every minor detail. Then Summer happened. Our team of around 12 helpers steadily diminished as people went on holiday, were pulled in to pressing projects and one volunteer even went to the extreme length of pregnancy to avoid further involvement (just kidding, that would be terrible grounds for creating a new life). We sent out a google form with a few suggested topics and asked people in Product & Technology teams to pick the subjects that appealed the most to them.

Step 2. Easy pickin’s

Post-summer break the ‘survivors’, now a measly 4-5 people, reconvened to discuss next steps and pick our panel topics. The favourite topics, by a landslide, were product goals, Agile project management, what we choose to measure and tech culture at the FT. One topic which was a close runner up was ‘How can we learn from failure?’ which is good food for thought. Maybe this is a topic we can pick up at next year’s Engine Room Live..

It was settled. We had our panels and now looked to the task at hand; finding willing panelists and panel moderators. We sent out a call to arms and were lucky enough to receive some replies. With a bit of prodding several more volunteers appeared from the wood work. Good stuff. We had everything in place panel-wise.

Step 3. Don’t forget the snacks

The most vital part of planning any event is providing a delicious incentive for guests to attend. Conferences have t shirts. The oscars have lavish goodie bags. We had PIZZA and BEER. Two traditional tech staples. This year’s Engine Room Live also included highly requested soft drinks and some lighter snacks so as to be inclusive for those who do not drink alcohol or would prefer a healthier option.

Step 4. Audience participation on the sly + the best quotes of the day

We wanted our audience to feel included in our panels without the interruption and hassle of microphones or catch boxes. Nobody likes microphones, the poor mic runners have to dash to make sure questions are heard without having to be repeated, then the microphone will inevitably squeak and crackle for the first 5 seconds of use leaving the speaker overly self aware of their own voice so they start using a warped tone and begin to audibly question their whole existence. Not fun for anyone. To avoid this shy introvert’s nightmare we used slido which allows audience members to ask questions anonymously, or by name, from their phones or laptops.

Our panellists and moderators were all excellent. Here are some of the top quotes of the day:

  • “I read a blog post on how to be a moderator so that’s why I’m so great at this”
  • “Instagram’s that photo app.. Right?”
  • “We’re a news company.. In case you didn’t know”
  • “It was the hoodies in the garage, not the suits in a meeting room!”
  • “You could say that a group of 12 men could have figured that out but actually, they didn’t”
  • “It’s not offensive because penguins aren’t a marginalised group”

Step 5. Humble brag

We had a great turnout with over 200 members of staff attending in person or via livestream throughout the day. This was an excellent example of grassroots engagement, staff were actively participating either on stage, as audience members or by asking questions to panels.

Step 6. What did everyone else think?

The week after the conference the committee sent out a form requesting feedback from attendees. 83% rated the event as 8/10 or higher on satisfaction level. Aim to please!

Lots of people complimented the frank, impassioned discussions that happened and how panels felt ‘honest’. A new joiner commented that they found the conference ‘refreshing’ for its openness. Another person noted the panel on tech culture was ‘one of the most interesting explorations of the subject I’ve experienced’ and they were happy to see debates not dominated by the ‘usual suspects’. Several people commented that they were pleased by the ‘inclusiveness’ and diverse perspectives showcased.

On the flip side one person thought the panels were too long and would’ve preferred more, shorter panels. One person felt there were too few senior faces in the crowd, although they applauded the senior team members who moderated or participated as panellists. Finally, one person’s only negative suggestion was to ‘be less nice to each other’, which I personally wouldn’t call a sign of defeat.

We also asked people if anything ‘unexpected’ happened. The responses were very interesting. Some people were pleasantly surprised at the discussions which took place. One other unexpected aspect which surfaced was the candidness of our panellists and their willingness to talk about deeply personal experiences within the workplace both at the FT and previous jobs.

Step 7. So, what did we learn?

Here are some takeaways from my perspective:

  • We have a great culture of respect, openness and honesty in FT Product & Technology
  • Some people here will go the extra mile to help others without expecting anything in return
  • Apps like Slido are a great way to encourage and enable smooth audience participation
  • People are motivated by a combination of product goals, their managers, teams, personal objectives and remuneration
  • It is really interesting to hear diverse viewpoints and learn about others’ take on subjects such as goals, how we work and what we choose to focus on
  • Inclusivity means including everyone in the conversation and the implementation of change

If you are a member of FT staff you can watch the panel recordings on Workplace by following the links below:

‘Are people motivated by product goals?’

‘Do we only measure things which are easy to measure?’

‘Are we actually ‘Agile’ and does it matter anyway?’

‘If you could change, and keep, one thing about FT tech culture, what would you choose?’

Until next time..

Serverless meetup at the FT

The FT hosted it’s second London Serverless meetup on Wednesday 11th October.  Around 60 people from across London came to hear about Serverless in the FT’s Conference Suites.

What’s serverless? Serverless is the aggregation of third party services (e.g. data stores), including ones that run simple business functions (Functions As A Service, known as FaaS).  AWS Lambda is one of the best known of these.  See; Serverless Architectures from Martin Fowler for more. Although it gets it’s name from not involving servers directly, everything runs on third party servers underneath…

At this meetup Yan Cui, Senior Developer at Space Ape Games spoke about “Lambda stories from the trenches“ where he described some of the problems they have faced and gone on to solve running their Mobile games platform on a Serverless Architecture.  Yan is a prolific blogger, check out some of his posts here; https://hackernoon.com/@theburningmonk

Ant Stanley then did a live demo of the new serverless framework (https://arc.codes/) released recently by Brian Le Roux and his team at Begin. A great framework if you’re focused on using serverless for websites or chat bots. Ant also over-ordered far too much pizza and drinks…

We have another Serverless meetup due on 15th November – please sign up here if you are interested in watching this area evolve.

Constructive sloppiness

Tl;dr For hackathons and things, insecure database solutions can save you a big wodge of time and Chrome extensions are a very versatile tool.

The other week, I took part in the FT’s annual internal hackathon. My team and I decided to play with our idea of bringing video-game-style ‘achievements’ into FT.com, with the aim of encouraging exploration and discovery of the site. It was great fun! Check it out:

poliwag screenshot

I got to try out some cringe-inducingly sloppy but highly effective techniques for quickly building rich prototypes. I’m not sure I should really be proud of them but I am.
Continue reading “Constructive sloppiness”

A Privacy Policy and Terms of Service for Polyfill.io

We’ve just published two new legal documents to the hosted version of the polyfill service, Polyfill.io. If you’re hosting your own version of the polyfill service, these documents don’t affect you – they only apply to people using a version of the polyfill service that we host.

The privacy policy outlines what data we collect about requests to Polyfill.io and what we do with that data.

The terms of service document describes what you can expect from us when you use Polyfill.io on your site, and the actions you, as a user, might take that would cause us to revoke your access to Polyfill.io.

Why have we added these documents?

Adding a privacy policy is best practice for service providers. As users of Polyfill.io it is important that you understand what we do with your data.

As for the Terms of Service, Polyfill.io usage has been climbing since we launched it 3 years ago, and is now used by sites around the world. At the FT we both maintain the open source project and host Polyfill.io for free (and Fastly provides free global caching on their CDN). The Terms of Service help ensure we can keep doing this.

What it means for users of Polyfill.io

If you’re a user of Polyfill.io, you should read and understand the Terms of Service and Privacy Policy. Neither of these documents change anything about Polyfill.io or the FT’s behaviour, they just document it.

Contact us

If you have any questions about these documents, you can reach us through the usual channels: https://twitter.com/polyfillio and https://github.com/Financial-Times/polyfill-service

Marshmallows

So, we’ve all heard of The Marshmallow Test, right? This is where children are tested on their ability to resist one marshmallow on the promise of getting two marshmallows later (a level of self-control that even as an adult I find a challenge!).

But what about the Other Marshmallow Test?

This is a game I conceived to illustrate the benefits of limiting work in progress – given as a talk at Agile in the City.

Cracking the WIP – The Other Marshmallow Test

The initial premise is about efficiency – can we complete our work faster (or consume our marshmallows more quickly) when doing it one piece at a time, or by multi-tasking? In itself it’s an interesting question, and in the game the answer often depends on the individuals who have volunteered and how much they enjoy marshmallows. There is so much more to observe when you try this out though, such as the impact on stress levels, managing risk, delivering value etc. You can download the slides here to get the full story.

The best thing about the game is that it shows just how far the concept of limiting work in progress applies to any work environment – this is not just about software! Upon seeing the game played in a lightning talk, a member of our legal team considered whether this could help them deal with the barrage of requests they get from all directions. An invitation to visit his team swiftly followed.

Kanban in the Legal Team

We played the game, talked about flow, and then I left them with Kate Sullivan’s talk about agile adoption within the legal team at Lonely Planet. A couple of weeks later I strolled by to witness the joy of a stand-up around a kanban board.

We talked about the benefits they were experiencing, as well as some of the challenges remaining. They still have things to improve (we all should continuously improve, after all) but they were finding a lot of blockers removed simply by visualising and verbalising them together. You can read more about how they have decided to apply agile in a post written by John Halton (Assistant GC) for Practical Law.

Learning at FT

One of the things I love about working at the FT is seeing teams from across departments learn from each other. Just as our legal team have learnt about agile from technology; our product and tech teams have learned a lot about how to use KPIs from our commercial teams; our editorial teams think more about reader engagement with help from our analytics teams; and here in engineering we continually exchange new lessons with every department we work with.

Removing the Tester Safety Net

approved.eps

Moving to Continuous Delivery and a Quality Focused Process

We’re all familiar with the waterfall approach of software development.  It keeps skill-sets in silos and, from a tester point of view, we were the ones squeezed for time when projects overran.

Adopting agile in the latest Membership Programme incarnation at the Financial Times many years ago started to make a change.  The concept of starting to break work into smaller pieces and working much closer to one unit as a team removed the big bang approach of these problems.  Ultimately they still existed.  Like most development teams our testers were outnumbered by developers, but ultimately had as much if not more to do.  The introduction of automated testing if anything made matters worse.  When you’re new to agile you can struggle to work out where to build automated tests into the process.  We agreed that they needed to be part of the sprint from day one, but this meant we still had split skill-sets – manual and automated testers.  Both were needed to get the work done. Continue reading “Removing the Tester Safety Net”

Splunk HTTP Event Collector: Direct pipe to Splunk

In August 2016 the FT switched from on-premises Splunk to Splunk Cloud (SaaS). Since then we have seen big improvements in the service:

  1. Searches are faster than ever before
  2. Uptime is near 100%
  3. New features and security updates are deployed frequently

One interesting new feature of Splunk Cloud is called HTTP Event Collector (HEC). HEC is an API that enables applications to send data directly to Splunk without having to rely on intermediate forwarder nodes. Token-based authentication and SSL encryption ensures that communication between peers is secure.

HEC supports raw and JSON formatted event payloads. Using JSON formatted payloads enables to batch multiple events into single JSON document which makes data delivery more efficient as multiple events can be delivered within a single HTTP request.

Time before HEC

Before I dive into technical details let’s look at what motivated us to start looking at HEC.

I’m a member of the Integration Engineering team and I’m currently embedded in Universal Publishing (UP) team. The problem that I was recently asked to investigate relates to log delivery to Splunk Cloud. Logs sent from UP clusters took several hours to appear in Splunk. This caused various issues with Splunk dashboards and alerts, and slowed down troubleshooting process as we didn’t have data instantly available in Splunk.

The following screenshot highlights the issue where event that was logged at 7:45am (see Real Timestamp) appears in Splunk 8 hours and 45 minutes later at 4:30pm (see Splunk Timestamp). Continue reading “Splunk HTTP Event Collector: Direct pipe to Splunk”