Logging, Alerting and Recovery on the Force.com Platform

At the FT in 2015 an initiative took place to retire our old legacy fulfilment platform.

This article describes the challenges around logging and alerting from a Salesforce.com perspective as we migrated to the new architecture.

The new architecture

Although the previous system had served us well, customer service agents were often having log into multiple applications to support FT.com customers and updates would typically flow to downstream systems slowly, via ELT data syncs.

Therefore the previous self contained legacy application was replaced with a series of micro services, each dealing with a single business process. The aim was then for Salesforce.com to be the single entry point for customer service agents to manage FT.com customers – with updates flowing to and from the backend API’s.

An events based messaging broker (Apache Kafka) was implemented for the micro services to publish and consume messages – with a bespoke Java App (the Salesforce Events Bridge) developed to orchestrate messages to and from Salesforce.com to the Message Queue.

A scheduled job within Salesforce.com would be responsible for processing the inbound messages and passing them to relevant handlers.

Salesforce.com would therefore be processing a high volume of messages from a variety of sources. The ability for Salesforce,com to process the messages efficiently and without error would be vital to the accuracy of the data presented to our Customer Service Agents.

With Salesforce.com automatically processing so many message types, logging would be vital in diagnosing issues in production.

Standard Salesforce.com logging

The standard logging tools provided by Salesforce.com are fairly limited. In order to have a debug log produced for a transaction a developer will have to specify a given user to trace and a given time period for which the log will run (the maximum being 24 hours).

What results is a fairly detailed log, but not the easiest to search or perform aggregates/analysis on.

Salesforce.com also limits log sizes to 2mb. So for a transaction with a large number of script statements you will hit the dreaded….

*** MAXIMUM DEBUG LOG SIZE REACHED *****

… and your debug log will be unapologetically truncated.

The out-of-the-box debugging tools provide a convenient way for a developer to interactively debug code they are working on – but are not adequate for post-mortem style debugging in a production environment.

Creating detailed, persistent logs all of the time

In order to go back and effectively diagnose any live issues we want to be able create customized, meaningful, easily searchable logs that would be persisted over a long term period.

The de-facto tool for many teams at the FT for has been Splunk. Splunk captures, indexes and correlates data in a searchable repository. Having Salesforce.com continually log key information to Splunk would allow developers to diagnose and analyse the processing of messages to and from the Kafka Message queue.

To do this a generic component was created within our Salesforce org that would allow developers to easily create a logs to a custom logging object within Salesforce.com. Simple example;

CustomLogger.info(CustomLogger.ORDER_MANAGMENT,’Licence Seat Allocation Request Creation Failed. error ‘ + err.getMessage());

A simple Java app was created that would periodically pull (every 15 minutes) from the custom object and persist to Splunk. As each batch completed the logs would be deleted from the custom object to prevent us hitting storage limits in Salesforce.com.

Using this we are able to log the stack trace of exceptions, check methods calls and have an audit of when script statements, SOQL limits etc are touching governor limits.

Monitoring and alerting

With the above logging in place we were then able to build out monitoring for our automated processes.

The inbound messages from the Kafka queue are processed by an Apex scheduled job that executes every 5 minutes passing the messages to the relevant handler for processing.

If this job fails to execute data will Salesforce.com data will become out of sync and actions raised by Salesforce.com end users will not be fulfilled downstream.

By creating a simple custom object that describes the scheduled jobs that we expect to run along with the frequency at which they should run – we can compare the Batchable Class name against the Apex Job queue history (querying the ApexClassID on the AsyncApexJob object).

If there is no history of the job executing within the specified period then the job has failed to execute and we can log this to our custom logger to be passed on to Splunk.

CustomLogger.info(CustomLogger.APEX_MONITORING, ‘Job failed to execute within Frequency Hours. Apex Class Id : ‘ + apexClass.Id);

A record of this failure now exists in Splunk. Searches with alerts can then be configured to notify the relevant teams via email, Slack etc that there is a problem – “messages from the Kafka Queue are currently not being processed by Salesforce.com.”

For a new apex scheduled job to be covered by this monitoring a developer simply has to add a record to the above object and an alert will be raised if it fails to execute within the stipulated time period.

Recovery – automatically restarting Apex jobs

Although the above approach will flag problems within a short time period it still relies on a manual step to resolve the issue.

In order for a support team to handle this they would require

  1. A full Salesforce.com license each ($ – costly)
  2. A powerful set of permissions to schedule the job as a generic user*

*our scheduled jobs all run as a generic user to keep control over the permissions required and to ensure the created by/last modified fields are easy to identify

To support this automated process would therefore be expensive (shared Salesforce.com logins are not an option!).

After investigating different areas of the platform we found an approach that would allow the restart of a job as a generic user without having to log into Salesforce.com

This was achieved by leveraging the out-of-the-box email service functionality.

Email services can be configured to trigger a piece of Apex code upon the receipt of an email to a specified (Salesforce.com) email address, from a specified (FT.com) email address.

But also – and more importantly – the apex code that is triggered can be set to “run as” a specific Salesforce user. In our case – our generic user. This in turn will mean that the restarted scheduled job will run within the context of the generic user (FT SF Support).

By utilizing methods in the System namespace and having the name of the scheduled class stored in our monitoring record (Schedulable_Class_Name__c) we can restart the job programmatically.

The Type class contains methods for getting the Apex type that corresponds to an Apex class and for instantiating new types.

We then cast it as the Schedulable interface due to the fact that newInstance() returns a generic sObject.

Schedulable schedulableClass = (Schedulable)Type.forName(scheduledJob.Schedulable_Class_Name__c).newInstance();

System.schedule() is then used to schedule the class using a Cron expression.

System.schedule(scheduledJob.Scheduable_Job_Name__c, scheduledJob.Scheduable_Expression__c, schedulableClass);

This keeps the solution very simple.
The code will not have to be extended to cater for new scheduled jobs. No schedulable class name need to be hard coded.

Now that a scheduled job can be restarted via email, without having to log in into Salesforce.com itself – we can go full circle and have this triggered from Splunk. The result being; that if a scheduled job fails to execute, a log will be created and the job will be restarted within minutes running under the context of a specified user.

Conclusion

The majority of Force.com applications and automated processes would not require the above investment.

However when Salesforce.com sits within an architecture comprising of micro services – and Salesforce.com acts as the single entry point to this functionality – it is vital that we have quality logging, it is vital that we are alerted to issues before our users and it is vital that automated processes are restarted in minutes rather than hours.

Hopefully the above article demonstrates that with a bit of upfront work logging can be implemented that provides real value and can help developers to improve the robustness of an implementation.

Also, by leveraging existing libraries provided by Salesforce.com code can be kept to a minimum and the end solution is configurable by an Administrator rather relying on amending code with every new scheduled job.