What Happens When You Visit ft.com?

The Financial Times front page.

This is an overview of how the Financial Times serves requests to www.ft.com. Starting with our domains, going all the way down to our Heroku applications, and through everything in between.

Table of Contents

  1. Domain Name System
  2. Content Delivery System
  3. Preflight
  4. Router
  5. Service Registry
  6. Applications
  7. Elasticsearch
  8. The End Result

Domain Name System (DNS)

We use Dyn, they are our name server provider. They are a single point of failure, but caching of our domain-name records should help during short outages.

Two of our most important domains are www.ft.com and ft.com(which is also known as the apex domain). Both domains point to our content delivery network (CDN).

For the www.ft.com subdomain we have a CNAME record pointing to f3.shared.global.fastly.net. This delegates the DNS resolution to our CDN, Fastly.

Our apex record however cannot contain a CNAME, so we instead use four A records.

Typically the CNAME record for www.ft.com will resolve to the same four A records as ft.com.

;; ANSWER SECTION:
www.ft.com.  3205 IN CNAME f3.shared.global.fastly.net.
f3.shared.global.fastly.net. 14 IN A 151.101.2.109
f3.shared.global.fastly.net. 14 IN A 151.101.66.109
f3.shared.global.fastly.net. 14 IN A 151.101.130.109
f3.shared.global.fastly.net. 14 IN A 151.101.194.109
;; ANSWER SECTION:
ft.com.   13336 IN A 151.101.2.109
ft.com.   13336 IN A 151.101.66.109
ft.com.   13336 IN A 151.101.130.109
ft.com.   13336 IN A 151.101.194.109
Anycast routing.

Fastly maintain servers in over 50 locations around the world, but we only see 4 IP addresses in our DNS queries.

So how does our traffic end up talking with the closest available Fastly server?

Fastly manage traffic on their network using the border gateway protocol and Anycastrouting, allowing them to send requests to the nearest point of presence while avoiding unplanned outages and locations that are down for maintenance.

Anycast is a network addressing and routing method in which datagrams from a single sender are routed to any one of several destination nodes, selected on the basis of which is the nearest, lowest cost, healthiest, with the least congested route, or some other distance measure.

Fastly route around outages in two ways. The first is at the DNS layer, updating their DNS records to avoid the problematic location. The second way is at the network layer, broadcasting new routes using BGP, this alters the path that a request’s TCP packets will take between routers.

At the end of all this we eventually connect to a Fastly server, so what happens next?

Content Delivery Network (CDN)

We use a CDN to reduce the number of requests made to our applications running in Heroku.

Much of our content is the same for all users, typically only a little different if you are logged in or not. If we cache these different versions in the CDN we can serve requests without even bothering the Heroku applications.

Caching

Our setup allows us to cache ~94% of all requests, with a cache hit rate of ~90%. So if we see something like 9,000,000 requests during a morning peak, by using the CDN’s cache we only pass on ~900,000 requests to our Heroku applications.

Fastly respect the Cache-Control or Surrogate-Control headers that our applications include in their response, as defined in the HTTP Caching Specification and the Edge Architecture Specification.

Let’s take a look at the caching headers for our home page (add a Fastly-Debug: 1 header to your request to see all these response headers).

GET / HTTP/1.1
Accept: */*
Host: www.ft.com
Fastly-Debug: 1
HTTP/1.1 200 OK
Age: 76
Content-Length: 41742
Content-Type: text/html; charset=utf-8
Date: Fri, 24 Nov 2017 09:24:39 GMT
Etag: W/"4fe7d-l04bmzZM7z5hmNTtslNXHn0d9L0"
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Surrogate-Control: max-age=86400, stale-while-revalidate=86400, stale-if-error=86400
Surrogate-Key: frontpage
Vary: Accept, Accept-Encoding

Here the main response headers we’re interested in are AgeCache-Control, and Surrogate-Control.

Age defines how long this response has been cached by Fastly, it also helps to indicate that this response was successfully served by the cache.

Cache-Control defines several directives, but in summary is saying this response should not be cached.

Surrogate-Control however is stating in the max-age that the response can be cached for 24 hours, the directives value being defined in seconds. In Fastly, a response with this header will be respected over any Cache-Controlheader. This allows us to define different caching rules for browsers and the CDN, as browsers ignore the Surrogate-Control header.

Serving Stale

We also define stale-while-revalidate and stale-if-error directives, which tells Fastly that we are happy to serve responses from the cache, even if the cached object’s Age has exceeded what’s defined in max-age.

stale-while-revalidate allows us to respond using a stale response while grabbing a fresh copy in the background, ensuring we’re responding to requests as quick as possible.

The difference between cache hit time and miss time shows why serving stale is so beneficial to our users.

stale-if-error is critical to how we deal with outages and errors, it tells Fastly to serve from the cache, including stale responses, if the backend is responding with errors. This gives us time to fix issues while reducing the impact to our users when things go wrong.

A quirk of Fastly means you must specify a max-age of over 61 minutes to ensure your response is cached on disk, and therefore available for a longer period of time in the CDN to serve stale. A cached object that’s only in memory can be removed for several reasons well before it’s deemed stale.

These two Cache-Control directives are part of an extension to the original Caching specification and are also supported in modern browsers.

Vary

The Vary header in the response is another part of the Caching specification.

It allows us to store different versions of a response depending on headers in the request.

Take the Accept-Encoding header in a request, and lets say we make two requests, the first with Accept-Encoding: gzip and the second with no such header, both to /search.

We will actually serve two different responses, the first will come back with a header of Content-Encoding: gzip and will be compressed using gzip. The second will not contain a Content-Encoding header and will be uncompressed.

It would be pretty bad for us to serve a compressed version of the page if the client does not ask for it. For this reason we must cache these responses separately, and this is where the Vary header comes in.

In this example we would respond with Vary: Accept-Encoding. This indicates that caches should store a separate version of the response depending on the value of the Accept-Encoding header in the request. Such caches include a client’s browser and our Fastly service.

For the website we actually take this a step further within Fastly and include several request headers that are decorated in preflight (as discussed later), so that when we serve different responses for A/B tests for example (see Vary: FT-Flags) we are still able to cache them in the CDN.

Purging

Given we tell Fastly to cache our front page for a whole day, how are we able to serve the latest version of the page to all our users?

By using the Fastly API we are able to purge the cached content. We also have an event driven system (using AWS Kinesis) that knows when content has changed, we can use this information to issue purge requests and serve the very latest news to our users.

Purge requests issued to Fastly on a typical Monday.

Fastly supports several types of purging. The most simple method is to issue a hard purge by URL, but this may result in a slower response for a few users.

Our autonomous systems make heavy use of soft purging by surrogate key, as this should result in no end user impact, and ensure all related content is purged, even if it exists on multiple URLs (e.g. //?edition=uk, and/?edition=international).

How does soft purging result in no end user impact? It is very similar to what we discussed earlier in our use of stale-while-revalidate. Soft purging in essence marks cached responses with the given surrogate key as stale, even if they are still fresh according to their max-age value. This then allows Fastly to serve the stale response until they’ve fetched a fresh version in the background.

Fastly’s point of presence around the planet.

The Fastly Black Box

The normal Fastly stack consists of a two layer system, first they have a HTTP proxy called h2o used to manage TLS termination, and then we have a popular caching HTTP reverse proxy called Varnish.

Fastly maintain their own fork of Varnish and have heavily modified it to suit their platform, so while this means we define our logic in Varnish Configuration Language (VCL) we must refer to the Fastly documentation more than Varnish’s.

The Fastly black box.

For www.ft.com however we are not using the h2o part of the Fastly black box. In order to support TLS 1.0 and 1.1 for IE 10 support we are instead pointing at a different bit of their infrastructure to handle the TLS termination.

Decorating Requests

An important part of what we do to a request in Fastly is decorating it with a whole bunch of metadata (e.g. session state, A/B test groups, etc.). This is handled by our Preflight application.

There’s a complex bit of VCL that passes the request to Preflight, takes the response and enriches the original request, then restarting the Varnish state machine to either serve the request from cache or fetch a fresh response from our applications.

What make us Platinum?

Simplifying what happens in Fastly, we allow their platform to do a bit of caching, and every now and then ask our applications for new content.

To be platinum we must be able to serve request from tworegions, to cater for an outage of a whole region. For us that means we run in Heroku’s EU and US regions.

When Fastly does talk to our applications it actually runs through a snippet of VCL that determines which region should serve the request. Ideally this is the closest region to our visitor (e.g. a request from New York should be served by the US Heroku region). However if a region is unhealthy, which we continually monitor for, our Fastly service will fallback to the other, hopefully healthy region.

Preflight

This is a Heroku application, which lives at https://github.com/Financial-Times/next-preflight.

Preflight forwards the user’s request for a web page to several other FT APIs in order to decorate the request with various properties.

Preflight gathers test information from our Ammit service, vanity URLs from our URL management service, subscription information from membership’s Access service, barrier page information from our Barrier Guru service, and finally session information from membership’s Session service.

By doing this in Preflight in combination with Fastly we avoid having to do all this work in each of our applications, they can just make use of the decorated request.

Router

This is another Heroku application, but it is a little different from our typical Express.js applications.

It lives at https://github.com/Financial-Times/next-router.

The router is a simple streaming HTTP proxy that takes a request and passes it on to the correct application. We define where requests should be sent to in our service registry, for example requests to ^/search are directed to the search page Heroku application.

The ft.com router.

Service Registry

Our service registry is a basic JSON document that is hosted as a platinum service. It’s stored in S3 across two regions, and uses a similar setup to our ft.com Fastly service to serve from both regions.

Here’s a little example snippet with some extra details removed. You should be able to spot a path, ^/__foo-bar, and a Heroku app foo-bar-eu.

[
  {
    "name": "foo-bar",
    "description": "An example service.",
    "host": "www.ft.com",
    "tier": "bronze",
    "paths": [
      "^/__foo-bar"
    ],
    "nodes": [
      {
        "region": "EU",
        "url": "https://foo-bar-eu.herokuapp.com"
      }
    ],
    "repository": "https://github.com/Financial-Times/next-foo-bar"
  }
]

Heroku and the Host Header

As an aside, it is worth discussing how Heroku knows where to send requests.

Heroku is a platform that only supports HTTP/1.1 requests, as it depends on the Host header to know which application should receive a request.

This is why we have applications called foo-bar-eu.herokuapp.com and foo-bar-us.herokuapp.com so that we can set the Host header in the router and send them requests accordingly.

While you can add custom domains, for the reasons above you cannot set the same custom domain on two different Heroku apps.

Applications

This is our standard Heroku application, for example the front page, or stream page. These are your typical Node.js based application running on Heroku. We use the Express.js framework.

We use components to share common functionality between all our applications, some examples being n-express and n-ui.

Typically the data sources for these applications will either be our Elasticsearch clusters, or the Next API.

Once our application has handled the request, it’ll travel all the way back through the stack, hopefully be cached by Fastly, and then sent on to our browser 🙌.

Going Platinum

With our microservice based setup, no two applications are the same. Because of this while www.ft.com is a platinum service, we also don’t offer support for the whole site 24/7. Our range of service “metals” is either bronze or platinum, though you may see gold and silver mentioned around the rest of the company.

For example, www.ft.com/search is a bronze service, but www.ft.com/?edition=international is platinum.

The main difference between bronze and platinum is that a bronze service only needs to run in a single region, while a platinum service, as discussed previously, must operate in two regions.

Elasticsearch

We run a platinum tier Elasticsearch endpoint, using two highly available clusters in two distinct AWS regions.

These clusters are our store of all content for www.ft.com, and is addressed using a single DNS record.

How does it work? We use a service provided by Dyn called Traffic Director, allowing you to achieve similar routing results to what we do in Fastly for www.ft.com.

The domain has two pools of addresses, one points at the US Elasticsearch cluster, the other at the EU cluster. If everything is healthy then Dyn advertises the closest pool to the request. If a pool is unhealthy then Dyn will not advertise it, falling back to the other healthy pool.

The difference between this and how we achieve platinum in Fastly is that this setup is entirely DNS based, and so when issues occur we will be advertising a different CNAME record (whereas in Fastly this all happens inside Varnish).

The End Result

What follows is a simplified overview of our stack.