Logging Errors vs Incident Management

TL;DR

One of the common questions that pops up in operational logging is what the various logging levels mean, and if you should alert operations when an error is logged in an application. This question conflates error logging with incident management, and can cause significant problems for operations in the form of alert fatigue.

Instead, use logging levels to indicate an operation's interruption or failure, and use error codes and structured logging to provide high cardinality events. Write a draft runbook to present to operations showing possible failures and their impact on the business.

From there, operations can create an incident management policy that can take logging as a telemetry source and combine that with metrics and past experience to determine the severity of an incident dynamicially.

Logging Errors vs Incident Management

The logging level is the importance of the failure to the application. An error is more important than a warning. A warning is more important than an information statement. Typically, errors and warnings are reserved for unrecoverable and recoverable problems, respectively. A warning may indicate a database connection was lost, which can happen on network timeout. An error will indicate that a database connection could not be regained. This is something that's known at compile time, and doesn't change.

The importance of those failures to the organization is context dependent. If it rises to the level of a problem that needs to be managed, it is called an incident. According to the severity of the incident, on-call people may be alerted and the incident may be subject to possible escalation. Entire careers are devoted to managing incidents effectively.

A logging event isn't an incident, and logging level and incident severity are different. Programs generate log events with given levels according how it impacts application performance, and the organization processes those log events and determines severity.

If an application sends warnings but sends them repeatedly on a high profile client or during specific procedures, that may warrant a very severe incident. If an application sends errors but the downstream cause is already known or is not significant, it may only qualify as a trivial incident.

This is all doubly true in a microservices environment, where multiple systems may have downstream dependencies and everything is intertwingled. In this kind of environment, developers for a single service simply don't have the overview of the system necessary to know what's the end experience for users, or how an application failure impacts other services

As Ryan Kitchens reports in How Did Things Go Right? Learning More from Incidents, severe incidents typically when there is a “perfect storm” of events. There is no single root cause.

Alerting on a single signal without context means false positives and one-offs. It means alerts that are easily disregarded. Repeated alerts cause alarm fatigue and burnout.

Logging something at error level doesn't – shouldn-t – mean an automatic page to someone at 3 am. It just doesn't work.

Managing Incidents in Production

Charity Majors advocates for an SLO based approach to alerting instead:

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule.

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”. Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

Love (and Alerting) in the Time of Cholera (and Observability)

Defining a service level objective depends on a service level indicator (SLI) being met. There's an error budget, which is the total number of failure-to-meet-the-SLI tolerated by your SLO. And then there's the burn down, which is how quickly you're running through the budget. If the burn down is notable, then a burn alert may be issued.

Honeycomb lets you define an SLO using a derived column, using the SLI cookbook. It's pretty self explanatory.

A Case Study

I had to think long and hard about directly criticising another blog post, but I think it's important in this case. It's all reasonable advice, and I used to think exactly like this, so I want to point out exactly how and where it goes wrong.

The point is made here that logging levels are the wrong level of abstraction, and an API should target this directly:

  • DEBUG: information that is useful during development. Usually very chatty, and will not show in production.
  • INFO: information you will need to debug production issues.
  • WARN: someone in the team will have to investigate what happened, but it can wait until tomorrow.
  • ERROR: Oh-oh, call the fireman! This needs to be investigated now!

With this usage in mind, the log level question becomes: “Do I want to wake up in the middle of the night to investigate this, or can it wait until tomorrow morning?”, which is very concrete and a question that even your Business Manager/Production Owner should be able to answer.

Of course this convention needs to be taught to new team members, and constantly reminded to existing ones. And as much as we love wikis, code reviews and email reminders, having to do the mental mapping from the log level to how the log statement is going to be used is not a natural one, and it is going to be forgotten.

So, after one of those code reviews where I asked several times “Why do you want to wake me up in the middle of the night for this exception?” I decided that we would switch the logger interface from SLF4J to an experimental “IGLogger” one which looks like this:

public interface IGLogger {
  void inTestEnv(String msg, Object... args);
  void inProdEnv(String msg, Object... args);
  void toInvestigateTomorrow(String msg, Object... args);
  void wakeMeInTheMiddleOfTheNight(String msg, Object... args);
}

Logging levels: the wrong abstraction

There are two issues here. The first issue is that in the blog post, the logging level definitions have severity baked into them, which leads to many discussions about their appropriate use. If the production owner means "the ops manager" then it makes sense to have them in the room, but the only time a business manager should be involved in that call is when an unexpected failure is also a business event that requires special handling, i.e. an auditing or security event.

The second issue is that even if the ops manager could make that call, severity is too dynamic to be baked into an API. If the database connection fails at 3 am but it's only being used during office hours, it's important to the application but not important to the organization.

You cannot know what corrective action to take at development time while you're logging. The most you can do is provide as much context as possible about the problem so that operations can manage it appropriately, and can determine what should and should not be paged on.

Logging Failures with Error Codes

Essentially, you want to log error codes. For every failure case that you log, you provide a specific error code as part of the structured context, and you provide that to operations. From there, operations and the business at large can determine a mapping of error codes to severity.

Error codes can be specific to business use cases, and you can have as many as you like. In particular, the same exception may have several different error codes depending on the business context: database connection failure to the "Witty Quips" database and database connection failure to the user authentication database may fail with the same exception, but should contain different error codes because the business use case is different.

Because error codes are inherently domain oriented, they are usually defined as part of an error handling framework, rather than being generated and logged directly in a catch block. Error handling frameworks vary widely in complexity, and can often be tied to the underlying framework – for example, Play has an error handler that can map source information to a UsefulException that it logs under the hood.

Working with an error handling framework that's been handed to you can be convenient, but you should keep your error code and error domain distinct from the framework itself, so that it's portable and has fewer dependencies. Keeping it in a small subproject is usually fine, but if you want to define your error handling in a completely framework-agnostic context, then I recommmend Backstopper:

Backstopper guarantees that a consistent error contract which you can define will be sent to callers no matter the source of the error when properly integrated into your API framework. No chance of undesired information leaking to the caller like stack traces, and callers can always rely on an error contract they can programmatically process. – Backstopper Overview

To implement error codes, you define an enum extends ApiError:

public enum MyProjectError implements ApiError {
    // Constructor args for this example are: errorCode, message, httpStatusCode
    EMAIL_CANNOT_BE_EMPTY(42, "Email must be specified", 400),
    INVALID_EMAIL_ADDRESS(43, "Invalid email address format", 400),
    // -- SNIP --
}

And then generate and throw an exception from a builder which requires the API error:

public class SampleResource {    

    @POST
    @Produces(MediaType.APPLICATION_JSON)
    @Consumes(MediaType.APPLICATION_JSON)
    public Response postSampleModel(SampleModel model) throws JsonProcessingException {
        // If the caller didn't pass us a payload we throw a MISSING_EXPECTED_CONTENT error.
        if (model == null) {
            throw ApiException.newBuilder()
                              .withExceptionMessage("Caller did not pass any request payload")
                              .withApiErrors(SampleCoreApiError.MISSING_EXPECTED_CONTENT)
                              .build();
        }
        // Run the object through our JSR 303 validation service, which will automatically throw an appropriate error
        //      if the validation fails that will get translated to the desired error contract.
        validationService.validateObjectsFailFast(model);

        // Manually check the throwManualError query param (normally you'd do this with JSR 303 annotations on the
        // object, but this shows how you can manually throw exceptions to be picked up by the error handling system).
        if (Boolean.TRUE.equals(model.throw_manual_error)) {
            throw ApiException.newBuilder()
                              .withExceptionMessage("Manual error throw was requested")
                              .withApiErrors(SampleProjectApiError.MANUALLY_THROWN_ERROR)
                              .withExtraDetailsForLogging(Pair.of("rgb_color_value", model.rgb_color))
                              .withExtraResponseHeaders(
                                  Pair.of("rgbColorValue", singletonList(model.rgb_color)),
                                  Pair.of("otherExtraMultivalueHeader", Arrays.asList("foo", "bar"))
                              )
                              .build();
        }

        return Response.status(201).entity(model).build();
    }
}

This is a lovely way to handle exceptions and give some immediate insight into the business impact of the error. Backstopper comes with a pretty complete user guide and has a minimum of dependencies – it defines an HTTP error code, but doesn't pull in the Servlet API into core.

Assemble a Draft Runbook

As development assembles the list of error codes, it should serve as the basis for a draft runbook written together with operations. This provides the starting point for winston-like runbook automation.

Matthew Skelton writes:

If we rely on run books to help us actually operate a system in 2013 or later, we have likely not automated or monitored enough. The key point about run book collaboration is to get Dev and Ops talking to each other about operational features during development, rather than leaving those conversations to a failure-laden ‘Production-ization’ phase.

The ‘run book’ (sometimes called the ‘system operation manual’, or just ‘operation manual’) is a collection of procedures and steps for operations teams to follow (either manually, or through run book automation) in order to enable the software to run effectively in Production. A run book includes details of how the operations team should deal with things like daylight saving time changes, data cleardown, recovery from failover, server patching, troubleshooting, and so on. Historically, it was the Ops team that wrote the run book, based on chance conversations with the Dev team, sketchy documentation, and much trial and error. However, by turning around this situation and giving responsibility for the first few drafts of the run book to the Dev team, substantial improvements to the operability of the software system can result.

Operability can Improve if Developers Write a Draft Run Book

I like this approach because it makes development aware that operations has a hand in development when it comes to ensuring that observability is built into the software. Skelton also has some excellent quotes as well:

Encourage and persuade Ops teams to engage with Dev teams during the development phase. This means Ops folk doing CRAZY STUFF like pairing on logging implementations, and attending stand-ups, planning meetings, and retrospectives.

Ops writing logging code to help them resolve production issues… Interesting concept…

Conclusion

Logging levels mark the level of failure of an application level operation. A logging event can contain an operation specific error code to provide more context than a raw exception.

An individual logging event may or may not constitute an incident. Monitoring and observability will determine what constitutes an incident using metrics and processed log information as telemetry data, based on criteria defined by operations and the business team.

As the development team puts together the error codes, it should work with operations and business teams to detail the business consequences of the error cases. Using this data, the organization can assemble a draft runbook for handling errors in production.

More …