What Happens When You Visit ft.com?

The Financial Times front page.

This is an overview of how the Financial Times serves requests to www.ft.com. Starting with our domains, going all the way down to our Heroku applications, and through everything in between.

Table of Contents

  1. Domain Name System
  2. Content Delivery System
  3. Preflight
  4. Router
  5. Service Registry
  6. Applications
  7. Elasticsearch
  8. The End Result

Domain Name System (DNS)

We use Dyn, they are our name server provider. They are a single point of failure, but caching of our domain-name records should help during short outages.

Two of our most important domains are www.ft.com and ft.com(which is also known as the apex domain). Both domains point to our content delivery network (CDN).

For the www.ft.com subdomain we have a CNAME record pointing to f3.shared.global.fastly.net. This delegates the DNS resolution to our CDN, Fastly.

Our apex record however cannot contain a CNAME, so we instead use four A records.

Typically the CNAME record for www.ft.com will resolve to the same four A records as ft.com.

;; ANSWER SECTION:
www.ft.com.  3205 IN CNAME f3.shared.global.fastly.net.
f3.shared.global.fastly.net. 14 IN A 151.101.2.109
f3.shared.global.fastly.net. 14 IN A 151.101.66.109
f3.shared.global.fastly.net. 14 IN A 151.101.130.109
f3.shared.global.fastly.net. 14 IN A 151.101.194.109
;; ANSWER SECTION:
ft.com.   13336 IN A 151.101.2.109
ft.com.   13336 IN A 151.101.66.109
ft.com.   13336 IN A 151.101.130.109
ft.com.   13336 IN A 151.101.194.109
Anycast routing.

Fastly maintain servers in over 50 locations around the world, but we only see 4 IP addresses in our DNS queries.

So how does our traffic end up talking with the closest available Fastly server?

Fastly manage traffic on their network using the border gateway protocol and Anycastrouting, allowing them to send requests to the nearest point of presence while avoiding unplanned outages and locations that are down for maintenance.

Anycast is a network addressing and routing method in which datagrams from a single sender are routed to any one of several destination nodes, selected on the basis of which is the nearest, lowest cost, healthiest, with the least congested route, or some other distance measure.

Fastly route around outages in two ways. The first is at the DNS layer, updating their DNS records to avoid the problematic location. The second way is at the network layer, broadcasting new routes using BGP, this alters the path that a request’s TCP packets will take between routers.

At the end of all this we eventually connect to a Fastly server, so what happens next?

Content Delivery Network (CDN)

We use a CDN to reduce the number of requests made to our applications running in Heroku.

Much of our content is the same for all users, typically only a little different if you are logged in or not. If we cache these different versions in the CDN we can serve requests without even bothering the Heroku applications.

Caching

Our setup allows us to cache ~94% of all requests, with a cache hit rate of ~90%. So if we see something like 9,000,000 requests during a morning peak, by using the CDN’s cache we only pass on ~900,000 requests to our Heroku applications.

Fastly respect the Cache-Control or Surrogate-Control headers that our applications include in their response, as defined in the HTTP Caching Specification and the Edge Architecture Specification.

Let’s take a look at the caching headers for our home page (add a Fastly-Debug: 1 header to your request to see all these response headers).

GET / HTTP/1.1
Accept: */*
Host: www.ft.com
Fastly-Debug: 1
HTTP/1.1 200 OK
Age: 76
Content-Length: 41742
Content-Type: text/html; charset=utf-8
Date: Fri, 24 Nov 2017 09:24:39 GMT
Etag: W/"4fe7d-l04bmzZM7z5hmNTtslNXHn0d9L0"
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Surrogate-Control: max-age=86400, stale-while-revalidate=86400, stale-if-error=86400
Surrogate-Key: frontpage
Vary: Accept, Accept-Encoding

Here the main response headers we’re interested in are AgeCache-Control, and Surrogate-Control.

Age defines how long this response has been cached by Fastly, it also helps to indicate that this response was successfully served by the cache.

Cache-Control defines several directives, but in summary is saying this response should not be cached.

Surrogate-Control however is stating in the max-age that the response can be cached for 24 hours, the directives value being defined in seconds. In Fastly, a response with this header will be respected over any Cache-Controlheader. This allows us to define different caching rules for browsers and the CDN, as browsers ignore the Surrogate-Control header.

Serving Stale

We also define stale-while-revalidate and stale-if-error directives, which tells Fastly that we are happy to serve responses from the cache, even if the cached object’s Age has exceeded what’s defined in max-age.

stale-while-revalidate allows us to respond using a stale response while grabbing a fresh copy in the background, ensuring we’re responding to requests as quick as possible.

The difference between cache hit time and miss time shows why serving stale is so beneficial to our users.

stale-if-error is critical to how we deal with outages and errors, it tells Fastly to serve from the cache, including stale responses, if the backend is responding with errors. This gives us time to fix issues while reducing the impact to our users when things go wrong.

A quirk of Fastly means you must specify a max-age of over 61 minutes to ensure your response is cached on disk, and therefore available for a longer period of time in the CDN to serve stale. A cached object that’s only in memory can be removed for several reasons well before it’s deemed stale.

These two Cache-Control directives are part of an extension to the original Caching specification and are also supported in modern browsers.

Vary

The Vary header in the response is another part of the Caching specification.

It allows us to store different versions of a response depending on headers in the request.

Take the Accept-Encoding header in a request, and lets say we make two requests, the first with Accept-Encoding: gzip and the second with no such header, both to /search.

We will actually serve two different responses, the first will come back with a header of Content-Encoding: gzip and will be compressed using gzip. The second will not contain a Content-Encoding header and will be uncompressed.

It would be pretty bad for us to serve a compressed version of the page if the client does not ask for it. For this reason we must cache these responses separately, and this is where the Vary header comes in.

In this example we would respond with Vary: Accept-Encoding. This indicates that caches should store a separate version of the response depending on the value of the Accept-Encoding header in the request. Such caches include a client’s browser and our Fastly service.

For the website we actually take this a step further within Fastly and include several request headers that are decorated in preflight (as discussed later), so that when we serve different responses for A/B tests for example (see Vary: FT-Flags) we are still able to cache them in the CDN.

Purging

Given we tell Fastly to cache our front page for a whole day, how are we able to serve the latest version of the page to all our users?

By using the Fastly API we are able to purge the cached content. We also have an event driven system (using AWS Kinesis) that knows when content has changed, we can use this information to issue purge requests and serve the very latest news to our users.

Purge requests issued to Fastly on a typical Monday.

Fastly supports several types of purging. The most simple method is to issue a hard purge by URL, but this may result in a slower response for a few users.

Our autonomous systems make heavy use of soft purging by surrogate key, as this should result in no end user impact, and ensure all related content is purged, even if it exists on multiple URLs (e.g. //?edition=uk, and/?edition=international).

How does soft purging result in no end user impact? It is very similar to what we discussed earlier in our use of stale-while-revalidate. Soft purging in essence marks cached responses with the given surrogate key as stale, even if they are still fresh according to their max-age value. This then allows Fastly to serve the stale response until they’ve fetched a fresh version in the background.

Fastly’s point of presence around the planet.

The Fastly Black Box

The normal Fastly stack consists of a two layer system, first they have a HTTP proxy called h2o used to manage TLS termination, and then we have a popular caching HTTP reverse proxy called Varnish.

Fastly maintain their own fork of Varnish and have heavily modified it to suit their platform, so while this means we define our logic in Varnish Configuration Language (VCL) we must refer to the Fastly documentation more than Varnish’s.

The Fastly black box.

For www.ft.com however we are not using the h2o part of the Fastly black box. In order to support TLS 1.0 and 1.1 for IE 10 support we are instead pointing at a different bit of their infrastructure to handle the TLS termination.

Decorating Requests

An important part of what we do to a request in Fastly is decorating it with a whole bunch of metadata (e.g. session state, A/B test groups, etc.). This is handled by our Preflight application.

There’s a complex bit of VCL that passes the request to Preflight, takes the response and enriches the original request, then restarting the Varnish state machine to either serve the request from cache or fetch a fresh response from our applications.

What make us Platinum?

Simplifying what happens in Fastly, we allow their platform to do a bit of caching, and every now and then ask our applications for new content.

To be platinum we must be able to serve request from tworegions, to cater for an outage of a whole region. For us that means we run in Heroku’s EU and US regions.

When Fastly does talk to our applications it actually runs through a snippet of VCL that determines which region should serve the request. Ideally this is the closest region to our visitor (e.g. a request from New York should be served by the US Heroku region). However if a region is unhealthy, which we continually monitor for, our Fastly service will fallback to the other, hopefully healthy region.

Preflight

This is a Heroku application, which lives at https://github.com/Financial-Times/next-preflight.

Preflight forwards the user’s request for a web page to several other FT APIs in order to decorate the request with various properties.

Preflight gathers test information from our Ammit service, vanity URLs from our URL management service, subscription information from membership’s Access service, barrier page information from our Barrier Guru service, and finally session information from membership’s Session service.

By doing this in Preflight in combination with Fastly we avoid having to do all this work in each of our applications, they can just make use of the decorated request.

Router

This is another Heroku application, but it is a little different from our typical Express.js applications.

It lives at https://github.com/Financial-Times/next-router.

The router is a simple streaming HTTP proxy that takes a request and passes it on to the correct application. We define where requests should be sent to in our service registry, for example requests to ^/search are directed to the search page Heroku application.

The ft.com router.

Service Registry

Our service registry is a basic JSON document that is hosted as a platinum service. It’s stored in S3 across two regions, and uses a similar setup to our ft.com Fastly service to serve from both regions.

Here’s a little example snippet with some extra details removed. You should be able to spot a path, ^/__foo-bar, and a Heroku app foo-bar-eu.

[
  {
    "name": "foo-bar",
    "description": "An example service.",
    "host": "www.ft.com",
    "tier": "bronze",
    "paths": [
      "^/__foo-bar"
    ],
    "nodes": [
      {
        "region": "EU",
        "url": "https://foo-bar-eu.herokuapp.com"
      }
    ],
    "repository": "https://github.com/Financial-Times/next-foo-bar"
  }
]

Heroku and the Host Header

As an aside, it is worth discussing how Heroku knows where to send requests.

Heroku is a platform that only supports HTTP/1.1 requests, as it depends on the Host header to know which application should receive a request.

This is why we have applications called foo-bar-eu.herokuapp.com and foo-bar-us.herokuapp.com so that we can set the Host header in the router and send them requests accordingly.

While you can add custom domains, for the reasons above you cannot set the same custom domain on two different Heroku apps.

Applications

This is our standard Heroku application, for example the front page, or stream page. These are your typical Node.js based application running on Heroku. We use the Express.js framework.

We use components to share common functionality between all our applications, some examples being n-express and n-ui.

Typically the data sources for these applications will either be our Elasticsearch clusters, or the Next API.

Once our application has handled the request, it’ll travel all the way back through the stack, hopefully be cached by Fastly, and then sent on to our browser 🙌.

Going Platinum

With our microservice based setup, no two applications are the same. Because of this while www.ft.com is a platinum service, we also don’t offer support for the whole site 24/7. Our range of service “metals” is either bronze or platinum, though you may see gold and silver mentioned around the rest of the company.

For example, www.ft.com/search is a bronze service, but www.ft.com/?edition=international is platinum.

The main difference between bronze and platinum is that a bronze service only needs to run in a single region, while a platinum service, as discussed previously, must operate in two regions.

Elasticsearch

We run a platinum tier Elasticsearch endpoint, using two highly available clusters in two distinct AWS regions.

These clusters are our store of all content for www.ft.com, and is addressed using a single DNS record.

How does it work? We use a service provided by Dyn called Traffic Director, allowing you to achieve similar routing results to what we do in Fastly for www.ft.com.

The domain has two pools of addresses, one points at the US Elasticsearch cluster, the other at the EU cluster. If everything is healthy then Dyn advertises the closest pool to the request. If a pool is unhealthy then Dyn will not advertise it, falling back to the other healthy pool.

The difference between this and how we achieve platinum in Fastly is that this setup is entirely DNS based, and so when issues occur we will be advertising a different CNAME record (whereas in Fastly this all happens inside Varnish).

The End Result

What follows is a simplified overview of our stack.

Engine Room Live 2017 – The Low Down

This year we held our third ‘Engine Room Live’ conference for the Product & Technology teams at the FT. It being the third time we have held this we had some previous learnings to bear in mind. The ‘original’ Engine Room committee decided it was time for ‘Gen 2’ to have a go at organising the event, for a fresh take on some hardy matters. So, with minimal hand holding and a solid process in mind, 12 people raised their hands..

Step 1. Make a plan

Our new planning committee held its first meeting all the way back in June. The first thing we did was pick a date. We scanned our diaries and set our sights on a time post summer holidays and the mad rush month that is September at the FT. We stumbled upon Friday 13th October. Were we asking for bad luck? Could this be a complete disaster? Never ones to be swayed by superstition we settled on it. Having four months to plan ahead we kicked back on our metaphorical laurels safe in the knowledge we had more than enough time to plan every minor detail. Then Summer happened. Our team of around 12 helpers steadily diminished as people went on holiday, were pulled in to pressing projects and one volunteer even went to the extreme length of pregnancy to avoid further involvement (just kidding, that would be terrible grounds for creating a new life). We sent out a google form with a few suggested topics and asked people in Product & Technology teams to pick the subjects that appealed the most to them.

Step 2. Easy pickin’s

Post-summer break the ‘survivors’, now a measly 4-5 people, reconvened to discuss next steps and pick our panel topics. The favourite topics, by a landslide, were product goals, Agile project management, what we choose to measure and tech culture at the FT. One topic which was a close runner up was ‘How can we learn from failure?’ which is good food for thought. Maybe this is a topic we can pick up at next year’s Engine Room Live..

It was settled. We had our panels and now looked to the task at hand; finding willing panelists and panel moderators. We sent out a call to arms and were lucky enough to receive some replies. With a bit of prodding several more volunteers appeared from the wood work. Good stuff. We had everything in place panel-wise.

Step 3. Don’t forget the snacks

The most vital part of planning any event is providing a delicious incentive for guests to attend. Conferences have t shirts. The oscars have lavish goodie bags. We had PIZZA and BEER. Two traditional tech staples. This year’s Engine Room Live also included highly requested soft drinks and some lighter snacks so as to be inclusive for those who do not drink alcohol or would prefer a healthier option.

Step 4. Audience participation on the sly + the best quotes of the day

We wanted our audience to feel included in our panels without the interruption and hassle of microphones or catch boxes. Nobody likes microphones, the poor mic runners have to dash to make sure questions are heard without having to be repeated, then the microphone will inevitably squeak and crackle for the first 5 seconds of use leaving the speaker overly self aware of their own voice so they start using a warped tone and begin to audibly question their whole existence. Not fun for anyone. To avoid this shy introvert’s nightmare we used slido which allows audience members to ask questions anonymously, or by name, from their phones or laptops.

Our panellists and moderators were all excellent. Here are some of the top quotes of the day:

  • “I read a blog post on how to be a moderator so that’s why I’m so great at this”
  • “Instagram’s that photo app.. Right?”
  • “We’re a news company.. In case you didn’t know”
  • “It was the hoodies in the garage, not the suits in a meeting room!”
  • “You could say that a group of 12 men could have figured that out but actually, they didn’t”
  • “It’s not offensive because penguins aren’t a marginalised group”

Step 5. Humble brag

We had a great turnout with over 200 members of staff attending in person or via livestream throughout the day. This was an excellent example of grassroots engagement, staff were actively participating either on stage, as audience members or by asking questions to panels.

Step 6. What did everyone else think?

The week after the conference the committee sent out a form requesting feedback from attendees. 83% rated the event as 8/10 or higher on satisfaction level. Aim to please!

Lots of people complimented the frank, impassioned discussions that happened and how panels felt ‘honest’. A new joiner commented that they found the conference ‘refreshing’ for its openness. Another person noted the panel on tech culture was ‘one of the most interesting explorations of the subject I’ve experienced’ and they were happy to see debates not dominated by the ‘usual suspects’. Several people commented that they were pleased by the ‘inclusiveness’ and diverse perspectives showcased.

On the flip side one person thought the panels were too long and would’ve preferred more, shorter panels. One person felt there were too few senior faces in the crowd, although they applauded the senior team members who moderated or participated as panellists. Finally, one person’s only negative suggestion was to ‘be less nice to each other’, which I personally wouldn’t call a sign of defeat.

We also asked people if anything ‘unexpected’ happened. The responses were very interesting. Some people were pleasantly surprised at the discussions which took place. One other unexpected aspect which surfaced was the candidness of our panellists and their willingness to talk about deeply personal experiences within the workplace both at the FT and previous jobs.

Step 7. So, what did we learn?

Here are some takeaways from my perspective:

  • We have a great culture of respect, openness and honesty in FT Product & Technology
  • Some people here will go the extra mile to help others without expecting anything in return
  • Apps like Slido are a great way to encourage and enable smooth audience participation
  • People are motivated by a combination of product goals, their managers, teams, personal objectives and remuneration
  • It is really interesting to hear diverse viewpoints and learn about others’ take on subjects such as goals, how we work and what we choose to focus on
  • Inclusivity means including everyone in the conversation and the implementation of change

If you are a member of FT staff you can watch the panel recordings on Workplace by following the links below:

‘Are people motivated by product goals?’

‘Do we only measure things which are easy to measure?’

‘Are we actually ‘Agile’ and does it matter anyway?’

‘If you could change, and keep, one thing about FT tech culture, what would you choose?’

Until next time..

Tuning Varnish Cache

The FT recently sent me on a Varnish administration course run by Varnish Software; based just around the corner from our London office.

Varnishing the floor.

It was a brilliant two days of learning all about Varnish cache and the VCL language, making good use of The Varnish Book for course material.

Here are some tips on tuning Varnish cache that we discussed during the course. Continue reading “Tuning Varnish Cache”

The case for accessibility

FT.com for everyone. Always.

At the Financial Times we’ve recently released a new version of our website, FT.com. “Next FT”, as we’ve come to know it, is now the default experience for our users, and so far it’s proving to be a great one: It’s faster, it’s nicer, it’s better; a success across the board [1][2]. Yet there’s an aspect of our new site we have largely overlooked: accessibility (a11y).

The new FT.com
The new FT.com

In this post we will explore what web accessibility is, why it’s important, the current state of accessibility at FT.com and the work we’re doing to improve it.

This will be the first of a series of posts that will document our progress on web accessibility at FT.com. Continue reading “The case for accessibility”

The Year of Lightning

Approximately a year has passed since Salesforce announced the new Lightning experience. And what a year for Salesforce! At first I thought ‘this is going to take a while, there’s going to be a learning curve, probably known bugs to deal with’, we tentatively started switching on the New Lightning Experience to play around with the new User Interface. In a short while we tested some visualforce pages embedded in the new Salesforce application. Finally, this summer we made the leap to building the first Lightning components and Lightning application.

Lightning Components framework is a set of out-of-the-box components build on the open source Aura framework. Developers can utilise Aura to build their own custom components and extend framework. The key here is that Lightning Components are client-side based. Lightning Components Framework has an event driven architecture and relies mostly on Javascript on the client side to manage the UI and application data. Hence it is much better performance wise as opposed to Salesforce classic technologies that rely heavily on the server. You can find more information by visiting these links:

Lightning Components Framework: https://developer.salesforce.com/docs/atlas.en-us.lightning.meta/lightning/intro_framework.htm

Open source Aura Framework: http://documentation.auraframework.org/auradocs

the-flashOne of my favourite series as a child was ‘The Flash’. He could miraculously find himself from his home dressed in pyjamas, down the street in front of a shop window within seconds. When I built my first Lightning app this year, the images from ‘The Flash’ running around with the speed of light immediately came to my mind. Three words: fast, simple, beautiful. No wonder they named it Lightning. Continue reading “The Year of Lightning”

Adventures with Neo4j and Timetrees

Update:  Since writing this blog I’ve learnt that there may be a better approach to this problem. These days, Neo4j allows you to make indexes on numeric properties and run range queries that use the index. We can take advantage of this for dates by storing them as millisecond timestamps, allowing us to perform date range queries without the need to maintain a time tree.

If you’re aware of this and still vaguely interested in time trees from an academic point of view, by all means read on 😁.


The new and improved FT website, launching 5 October, has many exciting and engaging new features, one of which is the subject of my own team’s focus: myFT.

screen-shot-2016-09-29-at-20-15-43

Continue reading “Adventures with Neo4j and Timetrees”

Putting Jetpacks On Our Membership Platforms; How the FT made message processing near real time in salesforce.com

FTUser2

In 2015 the FT replaced its monolithic subscription and entitlements system – replacing it with a platform of microservices and APIs (Find out more here). This provided the FT with a modern, scalable platform for managing our users and subscriptions on FT.com.

Continue reading “Putting Jetpacks On Our Membership Platforms; How the FT made message processing near real time in salesforce.com”

Automated API testing

catmemeDeveloping microservices with RESTful APIs means a large amount of testing will involve hitting endpoints and checking the results. From the tester’s point of view, this is a lot of doing the same thing over and over, hitting the same endpoints over and over.

As a result we concluded that automated tests was the way to go. We collectively decided to go with BDD style testing to make it easy for anyone to understand the test output. Continue reading “Automated API testing”

Salesforce at the FT – Orgs, Objects, and Runways

In 2011 the Financial Times made a strategic decision to use the Force.com platform for a number of key initiatives.

Salesforce was already embedded as a CRM (the ‘Sales Cloud’) for a subset of sales users. However, over an 18 month period the scope of this would be increased substantially; with all 2000+ employees having some level of access to Salesforce.

A suite of applications would be built on the Force.com platform supporting a broad church of business processes; from FT online subscriptions….to employee holiday requests ….from print advertising bookings….to cataloging equipment for journalists (such as flak jackets). Continue reading “Salesforce at the FT – Orgs, Objects, and Runways”