Targeted Diagnostic Logging in Production

TL;DR Diagnostic logging is typically not available in production, because of concerns that logging information at DEBUG level is indiscriminate. This blog post shows how to combine diagnostic logging with feature flag management to provide targeted debug information in production only for specific groups, users, or sessions.

The Definition

Diagnostic logging is defined here as "debug logging statements with an audience." In constrast to debug logging statements that are used for internal debugging and are for the developer alone, diagnostic logging is written and maintained for the development team, or for operations, or both.

A diagnostic logging statement should be as useful in production as code that records a metric, or sets up a distributed tracing span. These systems produce operational data that make systems observable. If business logic and data is blood, then observability is lymph. Imagine circulatory system and the lymphatic system, running in parallel to each other. Lymph is hidden and never seen, but it monitors and protects at all times. Likewise, diagnostic logging is usually not visible until there's a problem.

The Problem

You're running a complex stateful web application, or a set of them, all talking to each other using gRPC, Kafka, Akka Artery, Aeron, etc. Real cattle-not-sheep-k8s-in-the-cloud type stuff. Something is going wrong in production, and we don't know why. Some data comes into the system, is processed, and produces the wrong result, like a negative price. But it only happens in Europe, and only for users who use Firefox.

There are a specific set of tools that are available in production, but none of them will help in this circumstance. Metrics won't tell you why it happened. Distributed tracing will tell you the spans and you can log information them, but works on a sampling basis and this happens rarely enough that you can't catch it. There's no INFO level logs telling you what happened. The exception that is thrown happens long after the calculation has taken place, and can't tell you the internal state and flow leading up to that point.

You can argue that testing should have caught it. But tests have problems accounting for concurrency, and distributed architectures are complex concurrent systems by design. Tests can only deal with the code paths and the interactions that you as a programmer have considered and accounted for, which is a problem when you have multiple systems interacting a concurrent manner. Most problems in distributed systems are ones involving high-cardinality: it's Zebras All The Way Down. Tests can catch many use cases, but you will also be testing in production.

The Dichotomy

If this were happening in development, this would be easy. Start up the server on your local machine, remote attach through IntelliJ IDEA's debugger, and set a breakpoint where you think things are going wrong.

Or maybe it's a bit more complex than that, and you've got several services you need to run through, or want to compare logs. You'd add a bunch of debugging statements, turn the logger level up to DEBUG, and poke around.

The point here is that debugging involves exposing some internal state, which increases observability.

Using Charity Majors' definition:

Observability is how well you can understand the inner workings of your system simply by observing its outputs. Look it up. This MATTERS, because observability is how you debug and understand unknown-unknown states, ones you've never seen before and could not have predicted.

Following on from this, the goal of observability:

The holy grail of observability is the ability to be able to ask any question, understand any previously unseen state your system may get itself into; without having to ship new code to handle that state (bc that implies you knew enough to predict it)

Diagnostic logging is a safe, performant, domain aware way of making internal state observable. But diagnostic logging is typically not available in production.

Why is that?

Indiscriminate Diagnostic Logging

Diagnostic logging is not used in production because the operational tools available to manage logging are crude and indiscriminate.

In many cases, logging levels cannot be changed in a running system. To change the logging levels, the service must be restarted with new configuration. In most systems, there is no facility to change logging levels on the fly. Even systems that do have this capability like Twitter Server will only change the logger level by an individual server, which is little help when dealing with distributed systems.

Even in cases where logging levels can be changed on the fly, logging levels are directly tied to the logger – in SLF4J, this is typically the class name. As such, turning the logger to DEBUG level means that all interaction with any instances of that class are logged at debug level, typically at high volume.

When there is a high volume of logs, it does two things. First, it exhausts the logging budget: both the cost of storage and the cost of indexing go up dramatically. Second, it has the risk of causing non-linear performance problems in the server, as the overhead of logging at high rates can stress the IO and network capacity of instances.

As Cindy Sridharan says in Logs and Metrics says:

By far, the biggest drawbacks of logs is how operationally and monetarily expensive they are to process and store. Albeit structured logging doesn’t suffer from some of the downsides inherent in pure string-based log manipulation, it’s still far more voluminous to store as well as slower to query and process, as evident from the pricing models of log (event) based monitoring vendors.

Because indiscriminate logging has both high known costs and unknown risks, and logging frameworks provide few out of the box mitigations such as budget aware logging, sometimes operations will simply not allow diagnostic logging in production, because they don't know how to do it safely.

Effective Diagnostic Logging

But diagnostic logging is vital in production. You simply must have logs in production. Log Everything All the Time puts it very succinctly:

To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now.

So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server, caching server, file server, and a few networks switches and routers along the way. Where did the fault happen? What went wrong?

All you have at this point are your logs. You can't turn on more logging because the Heisenberg already happened. You can't stop the system because your system must always be up. You can't deploy a new build with more logging because that build has not been tested and you have no idea when the problem will happen again anyway. Attaching a debugger to a process, while heroic sounding, doesn't help at all.

What you need to be able to do is trace though all relevant logs, pull together a time line of all relevant operations, and see what happened. And this is where trace/info etc is useless. You don't need function/method traces. You need a log of all the interesting things that happened in the system. Knowing "func1" was called is of no help. You need to know all the parameters that were passed to the function. You need to know the return value from the function. Along with anything else interesting it did.

A brief overview of Stack Overflow supports that debug statements are, in fact, meant for production scenarios:

Martin Ba:

DEBUG logging must be able to be used in production (but tends to still be off normally) TRACE logging is allowed to be such that using it in production (other than for a specific / short session) is infeasible

user22815:

DEBUG: used for dumping variable state, specific error codes, etc. For example: a web service might return error code 809214, which could be logged while the application tells the user "communication failed." Imagine a developer receiving a log from a user's system long after the error occurred and wondering "why did the failure occur?" that is a good thing to log at the debug level. Another example might be if a bug keeps occurring in production but is hard to reproduce, debug log certain variables or events in the troublesome module to help tell developers the program state when the error occurs to help troubleshooting.

GrayWizardx:

Debug - Information that is diagnostically helpful to people more than just developers (IT, sysadmins, etc.).

The usual compromise to indiscriminate debug logging is to have distributed tracing, but that is subject to sampling, and doesn't take into account the information that we do have: we already know it happens in a particular region, with a particular browser. For effective diagnostic logging issue is that we'd like to be discriminating about what debug statements show up. We want something more powerful than searching through whatever made it through random sampling. Rather than logging by the class name or by log level, we'd like to say "for the specific users we think are affected by this error, log everything that they do at DEBUG level" and still keep everyone else at INFO level. In short, we want targeting.

Targeted Diagnostic Logging using Logback and LaunchDarkly

Logback has the idea of turbo filters, which are filters that determine whether a logging event should be created or not. They are not appender specific in the way that normal filters are, and so are used to override logger levels on a per call basis. Turbo filters can be combined with marker-based logic that allows the application to decide when a logging statement can be generated.

This works especially well with a feature flagging service like Launch Darkly, where you can target particular users and set up logging based on the user variation – thanks to Felix Sargent for making the connection here. This gives us high cardinality targeting rules for free.

Using the debug statement is a matter of adding the user-specific marker, ensuring that it has the elements you need to filter:

public class Main {
public static void main(String[] args) throws IOException {
LDClient client = new LDClient("sdk-key");
LDMarkerFactory markerFactory = new LDMarkerFactory(client);
LDUser ldUser = new LDUser.Builder("http-session-id")
.country(country)
.customString("userAgent", userAgent)
.customString("browser", uaDetection(userAgent))
.build();
LDMarkerFactory.LDMarker ldMarker = markerFactory.create("debug.flag", ldUser);
Logger logger = LoggerFactory.getLogger(this.getClass());
logger.info(ldMarker, "Hello world, I am info, I output always");
logger.debug(ldMarker, "Hello world, I debug when the debug flag is enabled for this user");
client.close();
}
}

Now, by setting the country attribute to countries in Europe and setting the browser attribute to "Firefox", you can get fine-grained diagnostic information that is targeted only to those users who are experiencing the problem.

The implementation is available on github at https://github.com/tersesystems/terse-logback, under the turbomarker module.

Once you've got that information, you need to process it. The best way to work with logs is to use Honeycomb and leverage the dynamic sampling to alway sample debug logs, and then leverage Honeycomb's Query Builder to see all the logs associated with that group.

Designing for Diagnostic Logging

Good diagnostic logging must be considered as a system design consideration. It means thinking about writing logs as structured logging, rather than as unstructured messages. It means passing a context object through basically everything. It means thinking about cases where a system had to reconnect, and adding a debug statement just in case of gray failure. It means using diagnostic logging when testing in production.

Diagnostic logging is useful for documentation. Unlike comments, diagnostic logging contains compiled flow using existing data, and is "living" in a way that comments are not, because developers read logs but skip over comments.

Diagnostic logging is useful for bug fixing and maintainance, especially when log output is included with a bug. The first thing that a programmer will do when working with an unfamiliar codebase is add in debugging statements to figure out what's going on. By establishing a diagnostic logging framework and decent logs, it's that much easier to pick up the flow of events from the bug and tie it back to the code.

Diagnostic logging is useful for integration testing. Using structured logging, the entire flow can be verified end to end after the integration test has completed, even if the codebase is not technically a "stream events to kafka and do event sourcing" type of system. No special listeners or mocks need to be injected into the system, as the diagnostic logging is in itself an output.

Diagnostic logging does not mean that developers can't write some debug statements specifically to fix an issue, and then remove it. It does mean that there can be debug statements which are specifically there for production. It does mean that contrary to local debugging, developers are not the only audience for diagnostic logging, and must treat operations as the audience. Diagnostic logging must capture important decision points in the flow and expose only relevant state, as a part of domain-oriented observability.

Don't worry about getting everything right at once. As soon as you have diagnostic logging as an option, you'll be able to see where to spend your time and effort improving it.

As NextDoor Engineering puts it:

Being able to control the logging levels on running servers has been very useful at Nextdoor, both because we can turn logging levels up in order to debug hard-to-reproduce issues on production, and turn them down to allow us to focus on just some logs. Also, knowing that you can log at a DEBUG level in production makes DEBUG logs more valuable and encourages engineers to write more DEBUG logging.

If you prefer to keep diagnostic logging segmented away from raw logger.debug statements intended for internal debugging, then you may want to consider writing a helper to make it clearer. For example, using LaunchDarkly and logstash-logback-encoder, you can do the following

public class DiagnosticLoggingExample {
public static void main(String[] args) {
Config config = ConfigFactory.load();
LDClientInterface client = new LDClient(config.getString("launchdarkly.sdkkey"));
Logger logger = LoggerFactory.getLogger(Order.class);
LDMarkerFactory markerFactory = new LDMarkerFactory(client);
LDUser ldUser = new LDUser.Builder("abc123").customString("groups", singletonList("diagnostic-group")).build();
OrderDiagnosticLogging diagnostics = new OrderDiagnosticLogging(logger, markerFactory, ldUser);
Order order = new Order("id1337", diagnostics);
order.addToCart(new LineItem());
order.checkout();
}
static class Order {
private final OrderDiagnosticLogging diagnostics;
@JsonProperty("id") // Make available to logstash-logback-encoder
private final String id;
public Order(String id, OrderDiagnosticLogging diagnostics) {
this.id = id;
this.diagnostics = diagnostics;
}
public String getId() {
return id;
}
public void addToCart(LineItem lineItem) {
diagnostics.reportAddToCart(this, lineItem);
// ...normal business logic reporting
}
public void checkout() {
diagnostics.reportCheckout(this);
// ...normal business logic reporting
}
@Override
public String toString() {
return String.format("Order(id = %s)", id);
}
}
static class OrderDiagnosticLogging {
private final Logger logger;
private final LDMarkerFactory.LDMarker ldMarker;
OrderDiagnosticLogging(Logger logger, LDMarkerFactory markerFactory, LDUser ldUser) {
this.logger = logger;
this.ldMarker = markerFactory.create("diagnostics-order", ldUser);
}
void reportAddToCart(Order order, LineItem lineItem) {
reportArg("addToCart", order, kv("lineItem", lineItem));
}
void reportCheckout(Order order) {
report("checkout", order);
}
private void reportArg(String methodName, Order order, StructuredArgument arg) {
if (logger.isDebugEnabled(ldMarker)) {
logger.debug(ldMarker, "{}: {}, {}", kv("method", methodName), kv("order", order), arg);
}
}
private void report(String methodName, Order order) {
if (logger.isDebugEnabled(ldMarker)) {
logger.debug(ldMarker, "{}: {}", kv("method", methodName), kv("order", order));
}
}
}
}

Gives the following iff the user is a member of diagnostic-group:

{
"@timestamp": "2019-07-21T13:11:15.642-07:00",
"@version": "1",
"message": "method=addToCart: order=Order(id = id1337), lineItem=LineItem()",
"logger_name": "com.tersesystems.logback.turbomarker.DiagnosticLoggingExample$Order",
"thread_name": "main",
"level": "DEBUG",
"level_value": 10000,
"tags": [
"diagnostics-order"
],
"method": "addToCart",
"order": {
"id": "id1337"
},
"lineItem": {}
}

Note that the order's attributes are represented in both structured logging representation (the id attribute) and the traditional the line oriented toString representation shown in the message. Also note I'm not showing any "context" here, which would normally be user or request info, as it would just be confusing in this example.

Conclusion

Diagnostic logging provides extra instrumentation in production to explain unknown events, works as documentation explaining the code flow in the source code, and provides much needed context when fixing bugs. Concerns about volume and utility of logging in production can be addressed by precisely modulating and targeting diagnostic logging using feature flags.

Comments