Splunk HTTP Event Collector: Direct pipe to Splunk

In August 2016 the FT switched from on-premises Splunk to Splunk Cloud (SaaS). Since then we have seen big improvements in the service:

  1. Searches are faster than ever before
  2. Uptime is near 100%
  3. New features and security updates are deployed frequently

One interesting new feature of Splunk Cloud is called HTTP Event Collector (HEC). HEC is an API that enables applications to send data directly to Splunk without having to rely on intermediate forwarder nodes. Token-based authentication and SSL encryption ensures that communication between peers is secure.

HEC supports raw and JSON formatted event payloads. Using JSON formatted payloads enables to batch multiple events into single JSON document which makes data delivery more efficient as multiple events can be delivered within a single HTTP request.

Time before HEC

Before I dive into technical details let’s look at what motivated us to start looking at HEC.

I’m a member of the Integration Engineering team and I’m currently embedded in Universal Publishing (UP) team. The problem that I was recently asked to investigate relates to log delivery to Splunk Cloud. Logs sent from UP clusters took several hours to appear in Splunk. This caused various issues with Splunk dashboards and alerts, and slowed down troubleshooting process as we didn’t have data instantly available in Splunk.

The following screenshot highlights the issue where event that was logged at 7:45am (see Real Timestamp) appears in Splunk 8 hours and 45 minutes later at 4:30pm (see Splunk Timestamp).

Logs arriving to Splunk several hours late
Logs arriving to Splunk several hours late

The original log delivery pipeline included the following components.

  1. Journald – a system service that collects and stores logging data
  2. forwarder.go – Go application with a worker that receives events from journald and sends events to splunk-forwarder cluster
  3. splunk-forwarder cluster – A cluster of four EC2 instances and a load balancer that receives events from Go application and forwards them to Splunk Cloud

The following diagram illustrates the log delivery pipeline back then.

Original log delivery pipeline
Original log delivery pipeline

The initial investigation was focused on splunk-forwarder cluster and from the logs in the cluster it seemed like event timestamps on arrival to cluster were lagging behind. This indicated that the Go application with a single worker was not able to handle the volume of events it received from journald. So we started planning iteration 1 of forwarder.go.

Iteration 1: An event queue, parallel workers and Grafana metrics

The new forwarder.go release introduced an event queue that caches events while workers are busy sending events to the splunk-forwarder cluster. Also the number of workers was increased which enabled events to be delivered to splunk-forwarder cluster in parallel. The number of workers was made configurable so that we could easily add more workers in case there were not enough to process events from the queue. To gain visibility on internals of forwarder.go a few metrics were introduced and delivered to Grafana for graphing.

After Iteration 1 the log delivery pipeline diagram began to evolve.

Forwarder.go with event queue, workers and Grafana integration
Forwarder.go with event queue, workers and Grafana integration

After deploying new release to production it was disappointing to notice that the delay in log delivery had not fully been eliminated. But on positive note we now had better visibility on what was happening inside the Go application, thanks to Grafana.

One of the metrics we introduced was Event queue size. The following screenshot from Grafana after the deployment shows that queue size (of 256 events) was maxing out on most of the nodes in the cluster.

Event queue size metrics in Grafana
Event queue size metrics in Grafana

As mentioned earlier the number of workers was made configurable in this release, but increasing the number of workers from default 8 to 12 didn’t have much impact on the queue size. This was a strong indication that the bottleneck was elsewhere than in the forwarder.go application.

A closer look at the splunk-forwarder cluster revealed that few of the nodes in the cluster were struggling to process incoming messages at right speed. After these nodes were resized (adding CPUs and memory) log delay got reduced significantly but still queue size within the forwarder.go process was staying on the same level of 256.

Iteration 2: Splunk HTTP Event Collector with event batching

It was time have a fresh think about the current set up and look at alternatives to sending logs to Splunk Cloud via splunk-forwarder cluster.

I discovered a blog post about Splunk HTTP Event Collector and I decided to give it a try.

Getting started with HEC

To get started with HTTP Event Collector you will need an endpoint URL and an authentication token. You can request authentication token from Splunk Support.

Testing the token on command line

Once you have the token you should verify that it works and you are able to send data to Splunk. The easiest way to test the token is to use curl.

All you’ll need is the endpoint URL, token and some data to send to Splunk.

Here is an example command line command sending JSON document {“event”: “Splunk HTTP Collector event”} to the HEC endpoint with the token in Authorization header.

Testing authorisation token using curl
Testing authorization token using curl

When request is successful it returns a response: {“text”: “Success”, “code”0}

Splunk HEC client and event batching

Implementing HEC client required small amount of effort and it simplified the delivery process as we no longer had splunk-forwarder cluster in the diagram.

Forwarder.go connecting directly to Splunk HTTP Event Collector
Forwarder.go connecting directly to Splunk HTTP Event Collector

We also introduced a configurable batch size which enables forwarder.go to batch events before sending them to Splunk Cloud.

After deploying this release to live we could see a big drop in event queue size in Grafana.

Event queue size metrics in Grafana after iteration 2
Event queue size metrics in Grafana after iteration 2

At 17:30 mark in above graph the new release got promoted to production with default batch size of 10 which resulted in queue size falling below 100 events. At 17:40 mark batch size was reconfigured to 20 which made the queue size to drop below 50 events across all nodes.

After introducing Splunk HEC and event batching the forwarder.go application has much more head room in the event queue to store events from journald.

We no longer have to wait for hours for logs to appear in Splunk. Instead we can monitor logs in real-time with latency down to ~100ms.

Splunk real-time log ingestion
Splunk real-time log view

I strongly recommend HEC for any application that currently uses splunk-forwarder cluster.

Reference implementation of HEC client written in GO can be found in Github: https://github.com/Financial-Times/coco-splunk-http-forwarder.

 

Author: Jussi Heinonen

Senior Integration Engineer at Financial Times