Targeted Diagnostic Logging in Production

TL;DR Diagnostic logging is typically not available in production, because of concerns that logging information at DEBUG level is indiscriminate. This blog post shows how to combine diagnostic logging with feature flag management to provide targeted debug information in production only for specific groups, users, or sessions.

The Definition

Diagnostic logging is defined here as "debug logging statements with an audience." In constrast to debug logging statements that are used for internal debugging and are for the developer alone, diagnostic logging is written and maintained for the development team, or for operations, or both.

A diagnostic logging statement should be as useful in production as code that records a metric, or sets up a distributed tracing span. These systems produce operational data that make systems observable. If business logic and data is blood, then observability is lymph. Imagine circulatory system and the lymphatic system, running in parallel to each other. Lymph is hidden and never seen, but it monitors and protects at all times. Likewise, diagnostic logging is usually not visible until there's a problem.

UPDATE: I have a new blog post with a full definition of diagnostic logging wherein I cite my sources.

The Problem

You're running a complex stateful web application, or a set of them, all talking to each other using gRPC, Kafka, Akka Artery, Aeron, etc. Real cattle-not-sheep-k8s-in-the-cloud type stuff. Something is going wrong in production, and we don't know why. Some data comes into the system, is processed, and produces the wrong result, like a negative price. But it only happens in Europe, and only for users who use Firefox.

There are a specific set of tools that are available in production, but none of them will help in this circumstance. Dropwizard Metrics will tell you how many times it happened, but it won't tell you why it happened. The Opentracing API will tell you the trace and the spans and you can log information in a particular span, but it works on a sampling basis and this happens rarely enough that you can't catch it. Using an SLF4J logging framework at the standard INFO level won't help: there's no INFO level logs telling you what happened. The exception that is thrown happens long after the calculation has taken place, and can't tell you the internal state and flow leading up to that point.

You can argue that testing should have caught it. But tests have problems accounting for concurrency, and distributed architectures are complex concurrent systems by design. Tests can only deal with the code paths and the interactions that you as a programmer have considered and accounted for, which is a problem when you have multiple systems interacting a concurrent manner. Most problems in distributed systems are ones involving high-cardinality: it's Zebras All The Way Down. Tests can catch many use cases, but you will also be testing in production.

The Dichotomy

If this were happening in development, this would be easy. Start up the server on your local machine, remote attach through IntelliJ IDEA's debugger, and set a breakpoint where you think things are going wrong.

Or maybe it's a bit more complex than that, and you've got several services you need to run through, or want to compare logs. You'd add a bunch of debugging statements, turn the logger level up to DEBUG, and poke around.

The point here is that debugging involves exposing some internal state, which increases observability.

Using Charity Majors' definition:

Observability is how well you can understand the inner workings of your system simply by observing its outputs. Look it up. This MATTERS, because observability is how you debug and understand unknown-unknown states, ones you've never seen before and could not have predicted.

Following on from this, the goal of observability:

The holy grail of observability is the ability to be able to ask any question, understand any previously unseen state your system may get itself into; without having to ship new code to handle that state (bc that implies you knew enough to predict it)

Diagnostic logging is a safe, performant, domain aware way of making internal state observable. But diagnostic logging is typically not available in production.

Why is that?

Indiscriminate Diagnostic Logging

Diagnostic logging is not used in production because the operational tools available to manage logging are crude and indiscriminate.

In many cases, logging levels cannot be changed in a running system. To change the logging levels, the service must be restarted with new configuration. In most systems, there is no facility to change logging levels on the fly. Even systems that do have this capability like Twitter Server will only change the logger level by an individual server, which is little help when dealing with distributed systems.

Even in cases where logging levels can be changed on the fly, logging levels are directly tied to the logger – in SLF4J, this is typically the class name. As such, turning the logger to DEBUG level means that all interaction with any instances of that class are logged at debug level, typically at high volume.

When there is a high volume of logs, it does two things. First, it exhausts the logging budget: both the cost of storage and the cost of indexing go up dramatically. Second, it has the risk of causing non-linear performance problems in the server, as the overhead of logging at high rates can stress the IO and network capacity of instances.

As Cindy Sridharan says in Logs and Metrics says:

By far, the biggest drawbacks of logs is how operationally and monetarily expensive they are to process and store. Albeit structured logging doesn’t suffer from some of the downsides inherent in pure string-based log manipulation, it’s still far more voluminous to store as well as slower to query and process, as evident from the pricing models of log (event) based monitoring vendors.

Because indiscriminate logging has both high known costs and unknown risks, and logging frameworks provide few out of the box mitigations such as budget aware logging, sometimes operations will simply not allow diagnostic logging in production, because they don't know how to do it safely.

Effective Diagnostic Logging

But diagnostic logging is vital in production. You simply must have logs in production. Log Everything All the Time puts it very succinctly:

To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now.

So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server, caching server, file server, and a few networks switches and routers along the way. Where did the fault happen? What went wrong?

All you have at this point are your logs. You can't turn on more logging because the Heisenberg already happened. You can't stop the system because your system must always be up. You can't deploy a new build with more logging because that build has not been tested and you have no idea when the problem will happen again anyway. Attaching a debugger to a process, while heroic sounding, doesn't help at all.

What you need to be able to do is trace though all relevant logs, pull together a time line of all relevant operations, and see what happened. And this is where trace/info etc is useless. You don't need function/method traces. You need a log of all the interesting things that happened in the system. Knowing "func1" was called is of no help. You need to know all the parameters that were passed to the function. You need to know the return value from the function. Along with anything else interesting it did.

A brief overview of Stack Overflow supports that debug statements are, in fact, meant for production scenarios:

Martin Ba:

DEBUG logging must be able to be used in production (but tends to still be off normally) TRACE logging is allowed to be such that using it in production (other than for a specific / short session) is infeasible

user22815:

DEBUG: used for dumping variable state, specific error codes, etc. For example: a web service might return error code 809214, which could be logged while the application tells the user "communication failed." Imagine a developer receiving a log from a user's system long after the error occurred and wondering "why did the failure occur?" that is a good thing to log at the debug level. Another example might be if a bug keeps occurring in production but is hard to reproduce, debug log certain variables or events in the troublesome module to help tell developers the program state when the error occurs to help troubleshooting.

GrayWizardx:

Debug - Information that is diagnostically helpful to people more than just developers (IT, sysadmins, etc.).

The usual compromise to indiscriminate debug logging is to have distributed tracing, but that is subject to sampling, can only provide data limited to instrumentation libraries (unless you manually control it yourself), and doesn't take into account the information that we do have: we already know it happens in a particular region, with a particular browser. For effective diagnostic logging issue is that we'd like to be discriminating about what debug statements show up. We want something more powerful than searching through whatever made it through random sampling. Rather than logging by the class name or by log level, we'd like to say "for the specific users we think are affected by this error, log everything that they do at DEBUG level" and still keep everyone else at INFO level. In short, we want targeting.

Targeted Diagnostic Logging using Logback and LaunchDarkly

Logback has the idea of turbo filters, which are filters that determine whether a logging event should be created or not. They are not appender specific in the way that normal filters are, and so are used to override logger levels on a per call basis. Turbo filters can be combined with marker-based logic that allows the application to decide when a logging statement can be generated.

This works especially well with a feature flagging service like Launch Darkly, where you can target particular users and set up logging based on the user variation – thanks to Felix Sargent for making the connection here. This gives us high cardinality targeting rules for free.

Using the debug statement is a matter of adding the user-specific marker, ensuring that it has the elements you need to filter:

public class Main {
  public static void main(String[] args) throws IOException {
    LDClient client = new LDClient("sdk-key");
    LDMarkerFactory markerFactory = new LDMarkerFactory(client);
    LDUser ldUser = new LDUser.Builder("http-session-id")
            .country(country)
            .customString("userAgent", userAgent)
            .customString("browser", uaDetection(userAgent))
            .build();
    LDMarkerFactory.LDMarker ldMarker = markerFactory.create("debug.flag", ldUser);

    Logger logger = LoggerFactory.getLogger(this.getClass());
    logger.debug(ldMarker, "Hello world, I debug when the debug flag is enabled for this user");

    client.close();
  }
}

Now, by setting the country attribute to countries in Europe and setting the browser attribute to "Firefox", you can get fine-grained diagnostic information that is targeted only to those users who are experiencing the problem.

The implementation is available on github at https://github.com/tersesystems/terse-logback, under the turbomarker module.

Once you've got that information, you need to process it. The best way to work with logs is to use Honeycomb and leverage the dynamic sampling to alway sample debug logs, and then leverage Honeycomb's Query Builder to see all the logs associated with that group.

Poor Man's Observability

Charity Majors defines logs vs structured events and makes the point that an event can be described as a single, very big high cardinality log statement. Targeted operational (defined as INFO) logging with high cardinality events are useful for observability, because they can help you nail down where a problem is happening.

To create poor man's observability, you create a feature flag which can be on or off at a certain rate, and use in your request filter pipeline. Then, if the flag is active, you start accumulating context using event specific custom fields. Finally, at the end of the request, you write out the request by logging as described by Anton Drukh using operational logging.

public class HttpApp {
    public void possiblyLogRequest() {
        LogstashMarker aggregateMarker = Markers.aggregate(ldMarker, ...); 
        if (is2xx(statusCode) && logger.isInfoEnabled(aggregateMarker)) {
             logger.info(aggregateMarker, "request", kv("statusCode", statusCode), kv("duration", duration));
        } else if (is4xx(statusCode) && logger.isWarnEnabled(aggregateMarker)) {
             logger.warn(aggregateMarker, "request", kv("statusCode", statusCode), kv("duration", duration));
        } else if (is5xx(statusCode) && logger.isErrorEnabled(aggregateMarker)) {            
             logger.error(aggregateMarker, "Reply sent", kv("statusCode", statusCode), kv("duration", duration));
        }
    }
}

You can change the sampling rate on the feature flag on the fly, and add special processing if you need if you want to treat some requests differently.

After that, if you're not appending directly to Honeycomb, you can send log data (either directly or streaming through a log aggregator) to a database using JSONB using TimescaleDB. You can expire old data with drop_chunks, and run continuous aggregate queries against JSONB. If you need to do more complex data processing, you can use Apache Spark.

If you're asking yourself if this will scale, then my answer is use Honeycomb first, then have a personalized system you can query.

Note that this is operational logging: events at INFO or above level that shows the result from input, for the purposes of observability by operations. Diagnostic logging is different: it is typically verbose, messy, and can contain details that operations absolutely doesn't care about.

Designing for Diagnostic Logging

Good diagnostic logging must be considered as a system design consideration. It means thinking about writing logs as structured logging, rather than as unstructured messages. It means passing a context object through basically everything. It means thinking about cases where a system had to reconnect, and adding a debug statement just in case of gray failure. It means using diagnostic logging when testing in production.

Diagnostic logging is useful for documentation. Unlike comments, diagnostic logging contains compiled flow using existing data, and is "living" in a way that comments are not, because developers read logs but skip over comments.

Diagnostic logging is useful for bug fixing and maintenance, especially when log output is included with a bug. The first thing that a programmer will do when working with an unfamiliar codebase is add in debugging statements to figure out what's going on. By establishing a diagnostic logging framework and decent logs, it's that much easier to pick up the flow of events from the bug and tie it back to the code.

Diagnostic logging is useful for integration testing. Using structured logging, the entire flow can be verified end to end after the integration test has completed, even if the codebase is not technically a "stream events to kafka and do event sourcing" type of system. No special listeners or mocks need to be injected into the system, as the diagnostic logging is in itself an output.

Diagnostic logging does not mean that developers can't write some debug statements specifically to fix an issue, and then remove it. It does mean that there can be debug statements which are specifically there for production. It does mean that contrary to local debugging, developers are not the only audience for diagnostic logging, and must treat operations as the audience. Diagnostic logging must capture important decision points in the flow and expose only relevant state, as a part of domain-oriented observability.

Don't worry about getting everything right at once. As soon as you have diagnostic logging as an option, you'll be able to see where to spend your time and effort improving it.

As NextDoor Engineering puts it:

Being able to control the logging levels on running servers has been very useful at Nextdoor, both because we can turn logging levels up in order to debug hard-to-reproduce issues on production, and turn them down to allow us to focus on just some logs. Also, knowing that you can log at a DEBUG level in production makes DEBUG logs more valuable and encourages engineers to write more DEBUG logging.

If you prefer to keep diagnostic logging segmented away from raw logger.debug statements intended for internal debugging, then you may want to consider writing a helper to make it clearer. For example, using LaunchDarkly and logstash-logback-encoder, you can do the following

public class DiagnosticLoggingExample {

    public static void main(String[] args) {
        Config config = ConfigFactory.load();
        LDClientInterface client = new LDClient(config.getString("launchdarkly.sdkkey"));
        Logger logger = LoggerFactory.getLogger(Order.class);
        LDMarkerFactory markerFactory = new LDMarkerFactory(client);
        LDUser ldUser = new LDUser.Builder("abc123").customString("groups", singletonList("diagnostic-group")).build();
        OrderDiagnosticLogging diagnostics = new OrderDiagnosticLogging(logger, markerFactory, ldUser);
        Order order = new Order("id1337", diagnostics);
        order.addToCart(new LineItem());
        order.checkout();
    }

    static class Order {
        private final OrderDiagnosticLogging diagnostics;

        @JsonProperty("id") // Make available to logstash-logback-encoder
        private final String id;

        public Order(String id, OrderDiagnosticLogging diagnostics) {
            this.id = id;
            this.diagnostics = diagnostics;
        }

        public String getId() {
            return id;
        }

        public void addToCart(LineItem lineItem) {
            diagnostics.reportAddToCart(this, lineItem);
            // ...normal business logic reporting
        }

        public void checkout() {
            diagnostics.reportCheckout(this);
            // ...normal business logic reporting
        }

        @Override
        public String toString() {
            return String.format("Order(id = %s)", id);
        }
    }

    static class OrderDiagnosticLogging {
        private final Logger logger;
        private final LDMarkerFactory.LDMarker ldMarker;

        OrderDiagnosticLogging(Logger logger, LDMarkerFactory markerFactory, LDUser ldUser) {
            this.logger = logger;
            this.ldMarker = markerFactory.create("diagnostics-order", ldUser);
        }

        void reportAddToCart(Order order, LineItem lineItem) {
            reportArg("addToCart", order, kv("lineItem", lineItem));
        }

        void reportCheckout(Order order) {
            report("checkout", order);
        }

        private void reportArg(String methodName, Order order, StructuredArgument arg) {
            if (logger.isDebugEnabled(ldMarker)) {
                logger.debug(ldMarker, "{}: {}, {}", kv("method", methodName), kv("order", order), arg);
            }
        }

        private void report(String methodName, Order order) {
            if (logger.isDebugEnabled(ldMarker)) {
                logger.debug(ldMarker, "{}: {}", kv("method", methodName), kv("order", order));
            }
        }
    }
}

Gives the following iff the user is a member of diagnostic-group:

{
  "@timestamp": "2019-07-21T13:11:15.642-07:00",
  "@version": "1",
  "message": "method=addToCart: order=Order(id = id1337), lineItem=LineItem()",
  "logger_name": "com.tersesystems.logback.turbomarker.DiagnosticLoggingExample$Order",
  "thread_name": "main",
  "level": "DEBUG",
  "level_value": 10000,
  "tags": [
    "diagnostics-order"
  ],
  "method": "addToCart",
  "order": {
    "id": "id1337"
  },
  "lineItem": {}
}

Note that the order's attributes are represented in both structured logging representation (the id attribute) and the traditional the line oriented toString representation shown in the message. Also note I'm not showing any "context" here, which would normally be user or request info, as it would just be confusing in this example.

Conclusion

Diagnostic logging provides extra instrumentation in production to explain unknown events, works as documentation explaining the code flow in the source code, and provides much needed context when fixing bugs. Concerns about volume and utility of logging in production can be addressed by precisely modulating and targeting diagnostic logging using feature flags.

Post Script

After writing this blog post, I found inevitably that putting feature flags together with verbose logging is a known practice, although it looks like it's not widespread.

Christian Meléndez mentions "3. Verbose Logging for Troubleshooting" in 2018:

There have been times when I wish I’d enabled the verbose logging in my code—especially when I receive reports that some users have been experiencing errors in code. The first thing I do is try to reproduce the error. If I’m lucky enough and succeed, the next step is to see logs. The sad part of the story is that, at this step, there’s nothing more than “something unexpected happened here.”

It’s understandable if you don’t want to have verbose logging because disks could run out of space. No one wants to be awakened in the middle of the night with this news. But what if you could launch your code with verbose logging turned off and only enabled when you need it? Yes, with feature flags, it’s possible. You could activate all different types of logging that you previously configured but released them deactivated. Everyone will be happy with this approach: you get the information only when you need it, and no one has to worry about being paged about a low space problem or purchasing more space just to avoid it.

The Top 5 Use Cases for Feature Flags

and Willy Schaub writes about feature flags and verbose logging in Azure DevOps, from 2017:

You would like to include hidden features in your release and enable them for all users in production. For example, you want to be able to collect verbose logging data for troubleshooting. Using a feature flag, you can enable and disable verbose logging as needed.

Explore how to progressively expose your features in production for some or all users