Since I've been writing about logging, and specifically about targeted diagnostic logging in production, I think I should clarify my terms and cite my sources and inspirations in one blog post.
Clarification
Diagnostic Logging isn't another term for observability. Diagnostic logging is typically a sequence of time-series based low cardinality crappy events scoped to a unit of work or application transaction, because the primary function is as a live debugger in prod. It is possible to embed diagnostic logs in a single structured event on exception, but the diagnostic logs themselves are still time-series based.
Diagnostic logging is a specifically verbose, developer oriented style of logging. I can't point to the etymology of the phrase, but a random google sampling implies that diagnostic logging provides detailed information, is used for troubleshooting by developers, and must be specifically enabled because it is verbose. In SLF4J, this implies logging at a DEBUG or TRACE level. Diagnostic logging is also called "verbose" or "debug" logging, but the general concept is the same.
Diagnostic logging is interested in the why and how of answering questions. It is primarily message-focused, because a developer must decide what decisions and what state is significant in what logs, and write a meaningful explanation in the logs. Diagnostic logging can include TRACE level logging statements, but is distinct from tracing. Tracing will show entry and exit points and internal state changes, but does not have a developer in the loop discussing a decision or stage change in the context of the domain.
I refer to logging at levels higher than DEBUG as "operational logging", since it's a reflection of normal systems operations, and tends to be the logs that are most often seen by operations. Operational logging has several shades, but generally indicates sickness and health of a system. Component lifecycle events are rendered at INFO level, as are periodic healthchecks and status reports. Indications of system problems that are recoverable such as a lost database connection are rendered at WARN level, and unrecoverable problems such as inability to connect to the database are rendered at ERROR, and can result in alerts and pages.
Targeting is a marketing term that involves a selection criteria against a group of customers. Feature flags are one way of doing targeting, but it's possible to select on any criteria, i.e. specific fields in a Kafka message coupled with the time of day. Most targeted logging is scoped to a particular user session, or randomly on a request, but it is common – many HTTP frameworks, such as Finagle, contain support for request logging out of the box, or use a library like Logbook. The only real innovation in targeted diagnostic logging in production is in outsourcing the selection criteria to a feature flag system so that you can select your criteria (say, a particular user) at runtime in production.
Charity Majors defines logs vs structured events and makes the point that an event can be described as a single, very big high cardinality log statement. Targeted operational logging with high cardinality events are useful for observability, because they can help you nail down where a problem is happening. Targeted diagnostic logging is useful after that point, when it comes down to figuring out the decisions and state changes involved that could have caused the problem.
I use the term "diagnostic logging in production" because saying "debug logging" implies that this is specifically logging for debugging in a development environment, which is the exact opposite point. The reason I say "diagnostic logging is debug logging with an audience" is because too often, developers write logging statements at debug level as if they were scrawled notes, without putting thought into how clear or appropriate the debug statement would be to someone else.
I tend to think of requirements such as "audit logging" and "security logging" as business events that should be handled distinct from operational/diagnostic logging – diagnostic/operational logs can be lossy and disposable, where as auditing and security logs may need to be kept for much longer, and have specific features to prevent tampering and deletion. Also, auditing and security logging requirements typically do not change in production, so using log levels for auditing and security is typically unnecessary. In these situations, it's better to break out or decorate the business logic.
Finally and most important: diagnostic logging is not a new practice, and I can cite my sources. I will also show that there is a broad body of developers who deliberately leave DEBUG and TRACE statements in code for diagnostic purposes, not only in production, but in testing environments as well.
Diagnostic Logging in Production
Strictly speaking, diagnostic logging is as old as software itself. Erik Dietrich points out that verbose logging is widely used in operating systems, installers, and command line tools:
Perhaps the simplest, quickest example of finding verbose information happens when you go to the command line. For instance, take the gzip command in a Linux environment. This is one of the simplest commands imaginable–give gzip a file and it compresses or decompresses it.
But if you add the “-v” flag to it, you tell it to run in verbose mode. It then obliges by showing you additional information that includes the compression/decompression percentage.
– Verbose Logging: Your Magnifying Glass for Bad Application Behavior
However, in this context, I'm talking about diagnostic logging in a long running server or daemon with the following qualities:
- For purposes other than debugging,
- In different environments,
- For different users than the author of the logging statement.
Let's start with the literature. These are the books I have on hand, so if you have older books on logging practices I'd be interested in picking them up.
We'll start with The Pragmatic Programmer (1999), in the section "Build Code That's Easy To Test", thereby advocating for Test In Production:
"You'll often need to test a piece of software once it has been deployed. […] We can provide various views into the internal state of a module without using the debugger. […] Log files containing trace messages are one such mechanism." (Build a Test Window, p196)
The Pragmatic Programmer also discusses instrumenting methods with logging at TRACE level using AspectJ:
"[…] Log messages are normally generated by sprinkling explicit calls to some log function throughout your source. With APO, you implement logging orthogonally to the things being logged. […] If you weave this aspect into your code, trace messages will be generated. If you don't, you'll see no messages. Either way, your original source is unchanged." (p39)
The implementation I've gone in terse-logback with uses Java instrumentation and configuration in HOCON so recompilation is not required, but the approach is the same.
Beyond Software Architecture (2004) has a chapter on logging. The chapter defines the different audiences for logging, discusses the need to provide context and correlation, and discusses good logging practices in the "Log Management" section. Of particular note is the discussion of dynamically changing log levels in a production context:
"Complex systems benefit from the ability to configure logging parameters during operation (such as level of detail or kind of operations to log) during operation. This allows customer support to work with customers in real time to resolve key issues. An even more sophisticated approach is to construct your system so that it can decide if you need to log data. If your fraud detector notices that something is amiss, have it log additional information for a while." (p241)
Later, when discussing the use of logging levels and markers:
Debug: usually reserved for extended data: turned on when conducting error analysis." (p241)
Between the two paragraphs, I think it makes clear that debug level statements can be turned on when conducting error analysis during operation.
Continuous Delivery (2011) has a small section on logging that covers the basics:
"Your applications also need to produce good quality logs. In particular, it's important to pay attention to log levels. Most logging systems have several levels such as DEBUG, INFO, WARNING, ERROR, and FATAL. By default, your application should only show WARNING, ERROR and FATAL messages, but be configurable at run time or deploy time to show other levels when debugging is necessary. […] recoverable application errors, such as a user being unable to log in, should not belong anywhere above DEBUG level, whereas a timeout on an external system your application depends on should be at ERROR or FATAL level." (p320)
The book also mentions importance of working with operations on logging:
"Logging, which is part of auditability, should be treated as a first-level set of requirements, the same as any other nonfunctional requirements. Talk to your operations team to work out what they need, and build these requirements in from the beginning." (p320)
I've also found a couple of interviews and blog posts about diagnostic logging in production. Andrew Rivers discusses using debug and verbose (aka trace) logging in production contexts dynamically:
"Most logging frameworks support multiple levels of details. Typically, you’ll have error, warning, info, debug, and verbose as the default logging levels available to you, and your code should be instrumented appropriately. In production, you’ll probably log info and the above, but if you have problems in specific components, then you should be able to change the tracing to debug or verbose in order to capture the required diagnostic information. If you’re getting problems, you should be able to make this change to logging levels on the fly while your systems are running, diagnose the issue, then return the logging back to normal afterward."
Being able to change logging levels on the fly without disturbing the rest of the application is an important feature, and one that has been an issue in the past. Many Unix daemons allow for diagnostic logging only by changing the configuration file and rereading it on SIGHUP
, which can cause interrupted connections in the case of nginx and other high traffix servers.
Pierre Vincent has a great talk on using logging in building observable distributed systems, and discusses logging in production as a "better debugging" system in this interview with InfoQ:
"Vincent: Investing in observability means to be prepared to spend the time on instrumenting systems, giving ourselves the tools to cope for the unknowns that come in production. We can have different ways to give ourselves that information in production. It can be very simple at the start, such as some basic health-checks. There’s a lot of tools to do powerful things with time series metrics. Metrics, tracing, logging, correlations, structured logging, events; there’s not one solution for everything but combined together it just brings a really powerful solution. We have to admit things may go wrong but when they go wrong we have all of these things that help us as detectives, help us like to react and recover faster. […] If I had to pick one thing, I think logging is going to get a lot more love in the next while. Even though it has its limits right now, there is a lot of potential in structured logging: provide more detailed context and enable better debugging."
One aspect of diagnostic logging is the ability to select a single request to write out at diagnostic information in production without affecting the rest of the system, making it look almost like a trace. I've called this tracer bullet logging, and found other instances of tracer bullet using the same term, although I can't track down the etymology.
Yan Cui brings up tracer bullet logging in the context of AWS Lambda:
Having debug logs for a small percentage of invocation is great. But when you’re dealing with microservices you need to make sure that your debug logs cover an entire call chain.
That is the only way to put together a complete picture of everything that happened on that call chain. Otherwise, you will end up with fragments of debug logs from many call chains but never the complete picture of one.
You can do this by forwarding the decision to turn on debug logging as a correlation ID. The next function in the chain would respect this decision, and pass it on. See this post for more detail.
Julia Evans reports on the experience being a developer on call:
There’s nothing quite like a system breaking, being in charge of fixing it, and having no way of seeing what’s wrong to convince you of the value of building monitoring and logging into your application.
Being on call will teach you quickly what kinds of diagnostics you need to debug your application. If you get paged because your application is taking an abnormally long time to make database queries, you can start monitoring how long your database queries take! Then next time it’ll be much easier for the person on call to see if that’s the problem.
The great thing about this is that these lessons last even beyond your current on-call rotations – you can notice “hey, every time I write a program I end up logging how long its database queries take, I’ll just put that in at the beginning this time!”
Mikael Krief details diagnostic logging and telemetry using a feature flag system:
Many times, we collect data that would be invaluable to have when investigating issues but would otherwise be undesirable to display always. For example, logging verbose information to a console window or displaying the Application Insights telemetry could all be useful. Flags would allow us to choose when to enable that behavior or not - without redeploying code.
Furthermore, feature flags allow us to control the configuration in each environment independently. For example, if telemetry in the canary and early adopter deployment rings were overloading/blurring our metrics, we could decide to disable the behavior without redeploying.
It’s invaluable to have the ability to enable detailed Application Insights and console logging in production when investigating an issue. Similarly, we recommend to separate telemetry on canary, early adopter, and production environments, or use feature flags to control the metrics in production. We’re looking at disabling telemetry in canary and early adopter deployment rings, so as not to overload or blur the production metrics.
Matthew Skelton sums up diagnostic logging very nicely:
Summary: As a Developer, you cannot attach the debugger to your application in Production, but you can use logging in a way that helps you easily diagnose problems in both development AND Production. You also get to make friends with Operations people – win! The applications we’re developing and running are becoming increasingly distributed, with requests crossing several service or container boundaries. The traditional debugger is of limited use in these scenarios, but with a solid, easy-to-use log aggregation capability in place, we can begin to explore ways of using logging to improve our understanding of how the software behaves in Production and enhance our ability to diagnose problems rapidly. Logging becomes not just a ‘dump’ of exception or error details, but a rich trace of application execution.
Diagnostic Logging For Tests
Diagnostic Logging is not only used in production environments. Anthony Vallone describes diagnostic logging at Google in a testing context:
Log quality is equally important in test and production code. When a test fails, the log should clearly show whether the failure was a problem with the test or production system. If it doesn't, then test logging is broken.
But there's more that can be done with logging than just just report on the test failure. Logs can be an essential part of tests themselves.
Diagnostic logging is particularly useful for integration testing, because the logs themselves can verify the correct flow of data through the system. Artur Śmiarowski:
There is another interesting use case for logs that I stumbled upon a few months back. I had to implement an algorithm determining advertisement emission. Which ad, at which place should be displayed, given many different conditions. Even with multiple unit tests covering the codebase ensuring specific calculations return proper results, there is a risk that a whole system combining these components will not work properly.
In such case, especially when writing TDD, it's possible to test if logs are created exactly as we expect, given a specific input. It allows us to implement big features partially while testing along the way. To do this, we need to parse logs created in the test, finding if message exists.
The same solution can be applied when maintaining a monolithic project that doesn't work as intended and is difficult to unit test because of too many dependencies. Adding logging and some automated feature tests to old codebase can simplify it and serve as a data flow explanation.
Kartik Agaram discusses the same approach to testing:
I've been experimenting with a new approach to keep the organization of code more fluid, and to keep tests from ossifying it. Rather than pass in specific inputs and make assertions on the outputs, I modify code to judiciously print to a trace and make assertions on the trace at the end of a run. As a result, tests no longer need call fine-grained helpers directly.
Diagnostic Logging for Developers
All of the above posts are written from the perspective of diagnosing production or testing problems, and are not focused on the advantages of diagnostic logging in a developer context. However, logging is an effective debugging tool for developers, and there are many useful resources available specifically for developers.
Effective Debugging (2016) discusses logging statements explicitly in "Item 41: Add Logging Statements":
"Logging statements allow you to follow and comprehend the program's execution. […] Things you can log include the entry and exit to key routines, contents of key data structures, state changes, and responses to user interaction. To avoid the performance hit of extensive logging, you don't want to have it enabled in a normal production setting. […] Implementing in your application a debug mode allows you to increase the logging verbosity only when needed (see Item 40: "Add Debugging Functionality"). You can also configure several levels or areas of logging to fine tune what you want to see." (p109)
and makes the case for why DEBUG / TRACE statements should be kept in code even after debugging is complete:
"When you (think) you're done with a logging statement, resist the temptation to delete it or put it in a comment. If you delete, you lose the work you put to create it. If you comment it out, it will no longer be maintained, and as the code changes it will decay and become useless. Instead, place the logging statement in a conditional." (p111)
The author, Diomidis Spinellis, discusses using logging as opposed to using a debugger:
"Instructions in the program's code that generate logging and debug messages allow us to inspect a program's behavior without using a debugger. Some believe that logging statements are only employed by those who don't know how to use a debugger. There may be cases where this is true, but it turns out that logging statements offer a number of advantages over a debugger session, and therefore the two approaches are complimentary.
First of all, the location and output of a logging statement is program-specific. Therefore, it can be permanently placed at a strategic location, and will output exactly the data we require. A debugger, as a general purpose tool, requires us to follow the program's control flow, and manually unravel complex data structures.
Furthermore, the work we invest in a debugging session only has ephemeral benefits. Even if we save our set-up for printing a complex data structure in a debugger script file, it would still not be visible or easily accessible to other people maintaining the code. I have yet to encounter a project that distributes debugger scripts together with its source code. On the other hand, because logging statements are permanent we can invest more effort than we could justify in a fleeting debugging session to format their output in a way that will increase our understanding of the program's operation and, therefore, our debugging productivity.
Finally, logging statements are inherently filterable. Many logging environments, such as the Unix syslog library, Java's util.logging framework, and the log4j Apache logging services, (http://logging.apache.org/) offer facilities for identifying the importance and the domain of a given log message. More impressively, Apple's OS X logging facility stores log results in a database and allows us to run sophisticated queries on them. We can thus filter messages at runtime to see exactly those that interest us. Of course, all these benefits are only available to us when we correctly use an organized logging framework, not simple println statements."
Spinellis is an excellent resource in general. For example, he discusses debugging using a winnowing and sifting approach leveraging data processing tools:
"I was recently debugging some failures in the processing of data files close to a terabyte in size whose processing took days. To solve them, I first added to the processing program many debugging options that generated detailed logs. I also added many internal consistency checks and corresponding reporting. I was thus able to narrow down the problem to about a gigabyte of data that would fail in less than a minute of processing. Then, I wrote a small filter that would isolate from the data leading to the failure, the records likely to be involved in it. This further reduced the suspect data to a few kilobytes in size, which I could manually inspect to pinpoint the problem. To ensure that my solution was correct and would stay so in the future, I also added a test case modeling the error and verified that the program handled it correctly."
Henrik Warne, like Spinellis, writes about using diagnostic logging as a way to make code debuggable by default, and using logging as a preemptive technique:
"All programs need some form of logging built in to them, so we can observe what it is doing. This is especially important when things go wrong. One of the differences between a great programmer and a bad programmer is that a great programmer adds logging and tools that make it easy to debug the program when things fail. When the program works as expected, there is often no difference in the quality of the logging. However, as soon as the program fails, or you get the wrong result, you can almost immediately tell the good programmers from the bad."
Håvard Kvam Buvik writes about the different audiences in logging and how to target logs appropriately:
One of the main goals with logging in a production environment is to have buggy code make a lot of 'noise' in your log, and quality code to be 'quiet'. This could be achieved by:
- Finding out who the users are and what functionality that are important for them, regarding logging.
- Carefully considering WHEN to use WHAT logging level. Especially consider reserving a specific level (e.g. INFO) for lifecycle events, and configure the logging framework for running with that level in the production environment. All debugging events should be logged at a lower level, to keep the production log small and clean under normal operation.
- Implement buffering of logging events and have them flushed to the log only when a logging event at a configured log level is received
- In this way the logs will contain a lot of information from the code that is buggy.
- Considering whether audit logging should be solved as an application feature and not by the logging framework at all.
This post is also notable for being almost the only blog post I can find that provides an end to end worked example in SLF4J with correct information, discussing the use of INFO for lifecycle notification, DEBUG for state changes, TRACE for enter/exit information, and the correct use of markers for auditing.
Erik Hazzard makes the point that diagnostic logging serves as communication, by providing a "living comment" that shows up both in the source code and in a running program:
To me, the ideal commenting system involves small functions with a sentence or two high level description at the top of the function describing its purpose. This, coupled with well named variables, is expressive (e.g., "var x=4" doesn't count - this is position, some random variable, what?). Code should describe itself, but those descriptions often aren't enough to give a high level overview of what's happening. In reality, functions often times get to be larger than anticipated, and due to various non-programming constraints (e.g., time), we often don't have the luxury (or are too lazy) to go back and write proper comments or cleaner code.
Logging helps solve this problem. By sprinkling a function with log messages, you're implicitly providing a comment about what the should be doing at that point. When the program runs, it explicitly logs what it is doing. Reading and understanding a function's behavior is easier when there are comments, but if you can replace a lot of comments with log messages, you get both the benefits of documentation and explicit information when a program runs.
Hazzard also discusses the benefits of logging as a "persistent breakpoint" that can show significant state and behavior changes in the flow of code for other developers:
"With logging, it's almost as if I set breakpoints everywhere, so I can trace the flow of execution (and a lot of state) for every call after they've happened. This, of course, isn't convenient just for me."
Finally, there's good old Software Engineering StackExchange:
"A good logging scheme should allow you to diagnose the root cause of any problem without recompiling code. Obviously, just logging the stack trace isn't going to give you much context. On the other hand, logging every logical step is going to kill application performance and eat up a ton of disk space. The trick is to default your logs to the minimal amount of detail, but allow yourself the ability to turn up the logging level in the event of a problem."
"Ideally, you would be able to enable / disable on-the-fly as the application is running."
"Keep all logging code in your production code. Have an ability to enable more/less detailed logging in production, preferably per subsystem and without restarting your program."
– What are some patterns and anti-patterns of application logging
"DEBUG logging must be able to be used in production (but tends to still be off normally) / TRACE logging is allowed to be such that using it in production (other than for a specific / short session) is infeasible."
– Why does the TRACE level exist, and when should I use it rather than DEBUG?
"Note: You want at least TWO log files. One at DEBUG level - providing all the information you can possibly imagine you will ever need (including INFO and above). If that is much too much, then consider having an additional TRACE level. One at INFO level - providing the 10000 meter overview of the system. Important milestones go here. The INFO one is always on. The DEBUG one is on when needed. Write logging code anticipating that you may one day have to debug a situation where ALL you have is the log file, and doing it right may be the difference between getting fired and getting promoted.
Diagnostic Logging on Exception
I mentioned in an earlier blog post the history of ring buffer logging which goes back to Brian Marick in Using Ring Buffer Logging to Help Find Bugs, 2000. It turns out this pattern is also mentioned several times in the above blog posts.
Håvard Kvam Buvik discusses keeping diagnostic logging in a ring buffer and only dumping it when there is a problem, a pattern I call "triggered diagnostic logging":
Implement buffering of logging events and have them flushed to the log only when a logging event at a configured log level is received. In this way the logs will contain a lot of information from the code that is buggy.
Anthony Vallone also mentions it:
"When errors occur, the log should contain a lot of detail. Unfortunately, detail that led to an error is often unavailable once the error is encountered. Also, if you’ve followed advice about not logging too much, your log records prior to the error record may not provide adequate detail. A good way to solve this problem is to create temporary, in-memory log queues. Throughout processing of a transaction, append verbose details about each step to the queue. If the transaction completes successfully, discard the queue and log a summary. If an error is encountered, log the content of the entire queue and the error. This technique is especially useful for test logging of system interactions."
As does Tomasz Nurkiewicz:
As you already know, one of the biggest benefits of using logging frameworks are logging levels. By carefully choosing levels for each logging statement we can easily filter which logs should be present in our log files and which shouldn't. Also we can apply different logging strategies for different environments. This is in theory. In practice we often face the choice between: log everything just in case and handle gigabytes of meaningless log files or log only warnings and errors but when they actually occur, they are meaningless as well, lacking important debugging context. The idea isn't new (see [1], [2] and [3] for example), but somehow decent implementation is missing in both Log4J and Logback. And the idea is simple – as long as there is nothing wrong happening with the system: do not log anything or log very little – but silently memorize all debug logs in some cyclic buffer. And whenever disaster occurs (any log with ERROR level, probably an exception), dump the buffer first to provide meaningful context.
And finally Henrik Warne discusses triggered diagnostic logging in the context of logging that targeted to a session:
Session-based logging means that you get very detailed information on everything that happens in a certain session. There is no need to grep through lots of log statements, looking for the parts you are interested in. Conversely, there are no missing logging statements that would help understand what happened – everything is included. […] In our system at work, we have a useful variation of the session logging called Trace On Error. When activated, if an exception (e.g. a NullPointerException) is thrown, the complete trace of the session up to that point is output, in addition to the stack trace. It works by starting session logging for every session, but only outputting the result if an exception occurs. If not, all the collected session information (basically a bunch of Strings) is discarded. This puts some pressure on the heap and the garbage collection, but not an awful lot. We have used this mechanism on live system several times in order to get more information on difficult to solve bugs.
Conclusion
Diagnostic logging is a well known technique for resolving issues in production, testing, and development environments. Logging levels also may be changed dynamically in production environments, to surface diagnostic logs without restarting the system.
Diagnostic logging may be targeted to a particular scope, i.e. request or session, and may be written to an in-memory ring buffer and dumped on exception or on unusual system behavior.
Diagnostic logging is one possible instantiation of observability, but is not a replacement for observability as a whole, and is oriented towards a particular use case.
Comments