Adventures with Neo4j and Timetrees

The new and improved FT website, launching 5 October, has many exciting and engaging new features, one of which is the subject of my own team’s focus: myFT.

screen-shot-2016-09-29-at-20-15-43

myFT offers a means of curating a personalised FT newsfeed and email digest of specific topics chosen by the subscriber. One of the ways in which subscribers can find their preferred topics is via our recommendation engine, which we’ve developed by harnessing the power of the Neo4j graph database.

Neo4j is wonderfully suited to this sort of problem domain and we’ve thoroughly enjoyed using it.

The problem: travelling through time with Neo4j

Soon enough, we found ourselves in a situation where we needed to be able to select a set of nodes on a particular day or between two points in time. This proved to be surprisingly tricky using Neo4j.

Thus far, our approach to recording time had been to add a millisecond timestamp property to nodes and relationships. As explained in this excellent blog by GraphAware, when taking this naive approach, selecting events that occurred within a set time range requires a full scan of all candidate nodes. When the nodes in question represent FT article read events this can be a very expensive scan.

The solution: timetrees

The solution, as explained in GraphAware’s blog, is to introduce a timetree structure into the database which allows units of time to be selected as nodes. The timetree nodes can act as the start of a localised search that takes full advantage of the strengths of a graph database.

After first experimenting with GraphAware’s instant timetree plugin to generate the timetree on demand, we found our needs were more suited to building the timetree structure in advance. Following a blog by Mark Needham, we wrote a query to generate a timetree to the granularity of one day, including NEXT relationships representing the sequence of the days. You can see our finished query here. We can then represent the times at which events occur by adding relationships between the event and day nodes.

This is great! We can now select whole swathes of results between two dates with refreshing ease and efficiency. For example…

Selecting events by day

To get hold of all the events that happened on a particular day in our brave new timetree world, we can do something like this:


// returns all event that happened on 14/3/2016
MATCH (:Year {value: 2016})–(:Month {value: 3})–(d:Day {value: 14})<-[:at_time]-(e:Event)
RETURN e

I’ll dissect this a bit. On the left-hand side of the MATCH clause we have this snippet:

MATCH (:Year {value: 2016})–(:Month {value: 3})–(d:Day {value: 14})

Starting from the Year node, this snippet traverses the timetree to select our desired Day node.

The rest of this query is concerned with matching and returning all event notes that have an at_time relationship with this Day node.

screen-shot-2016-10-03-at-14-22-37

Selecting events within a range of time

To bring up data of the events that happened over a period of 3 days, we can do this:

// returns all events that happened between 12/3/2016 and 14/3/2016 (inclusive)
MATCH (:Year {value: 2016})–(:Month {value: 3})–(:Day {value: 15})<-[:NEXT*..3]-(days:Day)<-[:at_time]-(e:Event)
RETURN e

The first snippet of the query is similar to the start of the single-day query above:

MATCH (:Year {value: 2016})–(:Month {value: 3})–(:Day {value: 15})

We traverse the timetree to find the node for the day after our desired range.

The next snippet uses this day-after node as a starting point to find our range:

<-[:NEXT*..3]-(days:Day)

I mentioned earlier that our timetree has NEXT relationships linking its Day nodes. These allow the timetree to be traversed horizontally across days. The [:NEXT*..3] snippet uses variable-length relationship matching to select a three-deep chain of Day nodes that matches our desired range.

The rest of the query matches all event nodes that have an at_time relationship with any of the three nodes matched in the days parameter.

screen-shot-2016-10-03-at-14-27-43

Room for improvement in performance and aesthetics

The timetree solution allows us to get hold of all of our events pretty darn speedily. However, there are drawbacks.

First of all, it asks us to compromise on our readability and DRYness principals within our timetree-using queries, in that we are required to repeat the sizeable year-month-day traversal for each one. (That’s this bit: MATCH (:Year {value: 2016})–(:Month {value: 3})–(:Day {value: 15})).

We also took exception to the small but niggling overhead of the timetree-traversal query that had to take place for every execution, which we calculated as roughly 150 database hits per query.

Elegant time travel: the date string index trick

We came up with a solution to these problems that we’re fairly pleased with. We indexed all of the Day nodes by a string representation of that particular date, allowing us to perform what are effectively the same queries using the following pattern:


// returns all event that happened on 14/3/2016
MATCH (e:Event)-[:at_time]->(:Day {dayString:’14/3/2016′})
RETURN e

// returns all events that happened between 12/3/2016 and 14/3/2016 (inclusive)
MATCH(:Day {dayString:’15/3/2016′})<-[:NEXT*..3]-(days:Day)<-[:at_time]-(e:Event)
RETURN e

Year-month-day traversal boilerplate? Gone. Readability of query? Through the roof. Unnecessary timetree-related DB hits? Poof!

Wow! Tell me how

We want our day nodes to look something like this:

screen-shot-2016-10-03-at-14-32-27

And to do this we’ve written a query that you can take a look at here: https://github.com/Financial-Times/next-myft-api/blob/master/cypher/time-tree-index.cql

You might notice that in this query we add a similar property to the Month nodes. This allows us to use the same approach to select all events that occur on particular months, as well as days.

Once this step is done, we make the indexes, and we’re ready for time travel:

CREATE INDEX ON :Day(uuid)

The benefit of hindsight

There are of course, parts of this solution that we wish we had done differently, and certainly will do in the future.

‘Order by’-friendly date format

One of these days we may go back and retrofit an opportunity that, unfortunately, we didn’t spot at the time. You might notice that we’ve stored our date index property in little-endian format (https://en.wikipedia.org/wiki/Calendar_date#Date_format), e.g. 14/3/2016.

This makes the property really nice and readable (particularly if you’re not from the United States). However, if we had stored the date in big-endian format with zero padding, e.g. 2016/03/16, we would have created a situation wherein it was very easy to order query results by date, with only a very slight compromise to readability.

Doing it wrong

Before the date string idea came along, we trialled a much less palatable solution to a situation that required us to make lots of queries starting from the Day node representing the actual current day.

Willfully flouting the recommendations laid out in the Neo4J documentation (http://neo4j.com/docs/developer-manual/current/cypher/#match-node-by-id), our approach was to query and cache the internal Neo4j ID of the current Day node once an hour and use this to look up that node to start each query.

In our defense, this was just as fast as our final solution, but it was also very fiddly and really rather messy, so we were pleased when a better idea came along and we were able to sweep this one under the rug.

Controversial property names

Finally, I do feel a twinge of regret when I remember that our date string properties all have the key ‘uuid’, as, clearly, they do not contain UUIDs (https://en.wikipedia.org/wiki/Universally_unique_identifier).

This is a convention we adopted to fit our own identifiers around Neo4j’s reserved id property, which makes a great deal of sense for our nodes that are indexed against UUIDs, but less sense here. We made the decision to keep the uuid property as a convention across nodes in our graph for consistency’s sake.

Regardless, hooray!

You can sample the fruits of our labours on myFT at the new FT.com, and take comfort in the knowledge that the timetree-based queries under the hood read like poetry.