It's the calm after the storm. After a year working on logging and observability things, I gave a talk at Scalacon 2021 about debugging and observing your scala code and there will be another talk at o11ycon on conditional distributed tracing on June 9th.
Let's say you're running services in the cloud. You've got multiple hosts running multiple services, and at any given time they may be scaling up and down automatically, restarting, and generally being ephemeral computing resources.
Suddenly a wild exception occurs!
AskTimeoutException. The stack trace will tell you only that it came from a dispatcher. There's no other context associated with the ask timeout.
Observability will detect the exception. But in practical terms, even observable systems don't care about an individual exception. It cares about the health of the system, defined in terms of availability and throughput. If the system is processing a normal volume of events and all the hosts are up and there are "few enough" errors, then the exception itself isn't enough to trigger an action.
So say you're a developer, looking at an unreproducible exception with no context or clues.
What do you do?
Needle In A Haystack
This is debugging, but not in a nice safe development or staging environment. Many bugs will only exist in an sufficiently complex environment, where there's enough of a tire fire that emergent behavior comes into play. I talk about this more in developing in production.
Modern debugging cannot assume that a production system doesn't need debugging support. It's the reverse: you must assume that a production system will require debugging support, because that's the only place the bugs are.
Spinellis sums this up nicely in Modern Debugging: The Art of Finding a Needle in a Haystack:
Modern computing rarely involves an isolated process running on a system that matches a programmer's particular development environment. In many cases, the programmer is dealing with tens to thousands of processes, often distributed around the world and with diverse hardware ranging from resource-constrained Internet of Things (IoT) devices, to smartphones, to experimental and specialized platforms. While these systems are the fuel powering the modern economy, they also present programmers with special challenges. According to the insightful analysis by Ivan Beschastnikh and colleagues at the University of British Columbia, these are heterogeneity, concurrency, distributing state, and partial failures.4 Moreover, following my own experience, add the likely occurrence of events that would be very rare on an isolated machine, the difficulty of correlating logs across several hosts,2 and replicating failures in the programmer's development environment.
Common observability tools such as distributed tracing and event logging can aid in showing when a particular request errored, and are absolutely useful to developers.
Distributed tracing is typically implemented through instrumentation of libraries and frameworks. At most, spans will be rendered at a component level, rather than at method level (as is common in trace logging). Distributed tracing is commonly sampled, and while there is "tail based sampling" that increases the possibility of a trace happening on a request which has an error, tail based sampling only kicks in when an exception can be connected back to a particular request.
Event logging is a single log entry with a timestamp that contains relevant fields from the request and the response. If there's an exception, the event has enough information to record the stack trace. But again, if there's no exception, the event logging will not have enough information to show the bug.
Distributed tracing and event logging are request/response driven. They are very good at determining when an input produced unexpected output. They are not so good at determining when an input caused unexpected output in the next response. Internal state in particular is a hidden menace. In a stateless system, there are no places for logic to hide. When writing to a database or to an in-memory data structure, there are opportunities to cause corruption and infections of the system that may cause exceptions and failures that can appear completely disconnected to the actual cause.
In addition, they do not deal well with things that don't happen that should have happened. If a user registration should fail from a constraint but doesn't, distributed tracing or event logging will only record a successful result. This can make it difficult to track down, because often the happy flow is more aggressively sampled.
Developers Want Logging
Meanwhile, developers think in terms of printf debugging and logging statements.
printf debugging may not be the most fun at a party but they're always there when you need them.— Three macros in a trench-coat 💉💉🎉 (@mvsamuel) May 12, 2021
Developers know they think in terms of printf debugging.
Phrase of the day: "Caveman Debugging". There is no software issue so profound, complex, or inexplicable that it cannot be resolved via a surfeit of print statements.— ericasadun (@ericasadun) July 4, 2018
So much so that it's a trope:
What's your theory for why programmers reach for printf-debugging more frequently than other modes (like step-debugging)?— Amjad Masad ⠕ (@amasad) April 9, 2021
So much so that in a study of debugging, this showed up as a survey question:
The best invention in debugging was still printf debugging.
The survey found
SR1: Most developers use IDE debuggers in conjunction with logfiles and print statements. […] Besides using the IDE debugger, respondents indicated they examine log files (72.2%), followed closely by using print statements (71.6%).
SR2: Developers not using the IDE debugging find external programs, tests, print statements, or other techniques more effective or efficient.
And this mentality is reflected in production. In Debugging Incidents in Google's Distributed Systems, a difference in SWE (software engineers) vs SRE (site reliability engineers) is that given the option SWEs will gravitate to the logs first:
SWEs are more likely to consult logs earlier in their debugging workflow, where they look for errors that could indicate where a failure occurred.
There is even analysis on how programmers debug, based on information foraging. Broadly put, programmers sniff out cues in the environment to find where a bug. When there is not enough information to be useful, programmers add information through enrichment. Logging and printf statements are examples of enrichment.
In short, the vast majority of developers reach first for logging. Any other tools are mainly fallbacks when logging is insufficient.
Giving Developers What They Want
If developers start with printf debugging, think in terms of logging, and rarely use debuggers or fancy tools, then let's lean into that. Let's start by building tooling and infrastructure that support the way developers actually debug.
Making life easier for developers isn't just scratching a developer itch. Engineering time is expensive, and anything that reduces the amount of engineering time spent on debugging is a productivity win. Reducing the time to debug is an upstream prerequisite for code-review latency, number of rollouts per day, project velocity, and a number of other common project health metrics.
The answer here is to log everything in production, down to method level tracing, in the application code that we control.
But we're going to be smart about it.
The Diagnostic Pipeline
There's an assumption that logging is expensive. The truth is that if you look at the costs, it's mostly storage of logs on disk and indexing that form the vast majority of the costs. Logging is cheap; it's logs that are expensive.
For logging at a
DEBUG or below level, there is a limited window in which that data is useful, and it's rare that we'll want to look at it. Rather than sending logs through the observability pipeline and incurring the costs of aggregating, indexing, and storing logs for days or weeks, we can set up a diagnostic pipeline where a bounded store is created and stored locally on the machine and can be exported on demand. This gives you a pool of debugging information on every server that may range from minutes to hours, depending on your logging volume – more than enough time to capture relevant data when a bug is detected.
The computational act of logging itself is inexpensive. The CPU overhead involved in logging is trivial. Most logging frameworks use an asynchronous appender with a dedicated thread that removes blocking. Advancements in newer JVMs have made complaints about memory allocation pressure and the corresponding costs in garbage collection obsolete.
Ah, but what about the I/O constraints? Given that a cloud instance can only write 240MiB/second to disk, won't you run out of throughput?
And the answer to this is that we don't need to log debug information to disk at all. Logging to memory is fast. The only problem with logging to memory is that generally memory is private to the application – this is the big drawback to using a private ring buffer. We need to write to a region of memory that can be accessible to multiple applications, so that we can read and tail various logs outside of the application.
Given that most cloud instances have gigabytes of memory that are simply unused, we can write to a bounded memory mapped file mounted on tmpfs. When the file reaches its maximum size, then the oldest entries are deleted, like a ring buffer. Using a file that is backed by memory provides the same speed of ring buffers, but makes the logs externally available to outside processes and the operating system, so it's easier to query and export data. The memory bandwidth napkin math estimates 1 GB/s of memory bandwidth at a lower limit, with 5 GB/s as a reasonable estimate.
Setting up a log file on
ramdisk is something that has to be done manually, and you'll have to serialize the logs to disk if you're shutting the server down, but depending on the distribution this may be automatic. It has the advantage that it will work without modifying your application, but it is a bit more fiddly.
Having the application do this automatically is better. We need an appender that supports bounded memory mapped files, writes out in a format that can be easily queried (avoiding the costs of scanning and parsing a linear file containing JSON), is cross platform, and has excellent reliability and a solid ecosystem. Fortunately, there's a tool that is perfect for the job: SQLite. By appending to an appropriately configured SQLite database using Blacklite, log entries can be inserted at a rate of 803K inserts per second.
Once you have this pool of data available, you can have your application query it for information.
SQLite makes this particularly easy as there's a JSON API with useful methods like
And SQLite has the ability to generate columns and manage views that can present that JSON appropriately. If there's an exception, you can have the application query SQLite, get a backtrace of everything relevant up to that point and save it to S3 with the Sentry error report.
Exporting data with SQLite is straightforward, because SQLite is well nigh ubiquitous.
SQLite information can be backed up using the backup API or a
VACUUM INTO, or pulled using sqlite-utils, or can be replicated to S3 or Azure using litestream.
Once you've got an SQLite file, you can work with it directly. If you want to look at several databases at once, you can
ATTACH and create a
UNION view across databases for up to 10 databases. You can use Pandas and convert the JSON to dataframes.
Producing Good Debugging Logs
At this point, we've got logging data we can work with. But we need to ensure that what is in the logging statement itself is also something we can worth with. The importance of providing good structured data in event logging is well-known, but structured data is also important in logging debugging information. If the goal is to log everything from the application into SQLite, then we have to change the way that we think about debugging data and go from looking at a flat log file containing JSON to querying a database with time ordered data. Debugging information is complex, and has to be structured. To do this, structured logging has to be more than possible – it has to be easy.
The issue here is that SLF4J predates structured logging. While
log4j dates from 1999, structured logging is first mentioned in 2006, and is only implemented in Java with
logstash-logback-encoder in 2013.
- 1999: log4j
- 2005: SLF4J
- 2006: Logback
- 2006: Logging in JSON
- 2011: Logs for Machines in JSON
- 2012: Log4J 2
- 2013: logstash-logback-encoder
As a result, it's very easy to get things wrong when using structured logging with Logback:
This is where Blindsight comes in. Blindsight is a Scala library that makes it easy to do structured logging the right way, extracting rich information out of objects that can be queried and analyzed later.
You can read about Blindsight's design choices here. I'll keep it simple and walk through a project.
DSL and Type Classes
Blindsight doesn't do
toString. Instead, there's a DSL to provide structured arguments.
Building up Context
Loggers can build up context and log with additional context:
Loggers can decide to log on condition. Conditions can be set on a logger using
withCondition, or on a statement using
Conditions are powerful. In particular, one of the great things about conditions is that they can incorporate application specific logic and can change the way that logging operates in a way that opens doors that
isLoggingDebug cannot. Consider logging by feature flag, by an operational "tracer bullet" flag, by a circuit breaker, by absolute or relative time, or by scripting.
Each of these is a blog post in itself, but I'll dig into scripting as an example as it shows what Blindsight is capable of.
Scripting Dynamic Conditions
Microsoft mentions dynamic conditions in The Bones of The System: A Case Study of Logging and Telemetry at Microsoft:
What’s really for us is to be able to real-time turn on and off of what log stuff you’re collecting at a pretty granular level.
Grub mentions the same thing in Structured Logging: The Best Friend You’ll Want When Things Go Wrong:
This makes the log levels actually more meaningful now - developers can now deploy with the usual WARN or INFO log levels, and when things go wrong, just with a configuration change they can update the log level to DEBUG and make their services output more logs when debugging. This also helps us keep our logging costs in check.
Blindsight takes this to the next level by enabling the evaluation of scripts that control logging, potentially logging by file name or method! For example:
when given the script:
Will render a debugging statement. This script can be pulled in from a string based feature flag and changed on the fly.
Flow Based Logging
Flow Based Logging is commonly used for tracing methods on entry/exit. This can be done by providing a behavior for the arguments as an implicit.
Because the behavior can be swapped out, this means you can do fun things and implement your own tracing if desired:
If developers use printf debugging rather than logging, it's usually because it's more work to type
logger.debug than it is to type
printf. To encourage logging, we need to make it easier to use than
printf. There needs to be less typing and less thinking involved in doing the right thing.
What do people actually do in printf debugging? From Industry Practices and Event Logging: Assessment of a Critical Software Development Process:
Finding 5: Event logging serves three major purposes in the reference domain, i.e., state dump, execution tracing and event reporting.
Since most debugging statements are conceptually simple, Blindsight can take over the work of constructing debugging statements, using macro-based inspections.
For example, we can decorate a block so that any declared vals will have a
logger.debug statement after the declaration:
We can do the same with if statements:
And given that public fields of a class are accessible, we can also dump these fields via a macro:
There are more intentions packaged, and I'm sure you can think of more commonly used intentions yourself. I've searched the literature but haven't found a body of work targeting these kinds of constructs – please let me know if you are aware of any papers or projects!
Blindsight covers the case where you own the code and can add logging, but often there are situations where the logging that you need to do is inside a library itself. We can do this by injecting logging with instrumentation, the same way that we do for profiling or tracing.
For example, in tersesystems/logging-instrumentation-example we log entries and exit of
and the output will contain:
This can be combined with flow tracing to produce instrumented tracing.
Instrumentation is particularly helpful when you are dealing with a dispatcher or Akka Stream which simply swallows the exception rather than propagating it. For example, we can instrument
failStage in Akka, or we can instrument
Where Do You Start?
The next question is where to start.
There's some general advice that is good for any application. Some of this runs counter to common wisdom, as you do have the ability to change logging levels at runtime.
For libraries, you should set the logging level to
INFO out of the box, or
WARN if they are very noisy (like Hibernate). You do have the option to change the logging level using ChangeLogLevel or your own tooling, but the caveat here is that libraries may have expensive operations and rendering in their debug output, and that may impact runtime operations.
For your own application, set the logging level for your base package at
TRACE level, and set a filter that stops anything below
INFO from heading to your normal ELK stack. If you have any expensive debugging operations that you want to run or or you want to dump some giant objects, use dynamic conditions so you can turn them on and off when you need them.
There may be situations where you want to suppress exceptions or errors because you are deploying or bouncing the database. Conditions can be useful here, but don't go overboard, and keep suppression temporary by tying it to a duration or deadline.
Improving the software logging practices in DevOps talks in terms of what-to-log, where-to-log, and how-to-log. Let's dig into that.
How Do You Log?
There are some cases where adding logging is either difficult, because there's limited access to the data structures or context.
One of the challenges in Scala is that logging through functional programming can be awkward because there's no facility for managing immutable data through a chain of logic. You can look for methods like
wireTap to inject logging into a functional chain of logic, but it can be hard to pry apart code that maximizes for cyclomatic complexity. treelog can integrate immutable operations into a tree structure that shows the results of computation, so it's a great option if you can build FP chains around it.
In addition, concurrency constructs can be an issue in logging. Logging, particularly structured logging, has to be aware of context. In Scala code, it can be disturbingly easy to lose context by passing through an async boundary, by passing data off to a
CompletionStage or just calling Akka Streams. There are ways to pass message context through Akka Streams but again, this requires being aware that logging is a concern and ensuring that context is handled appropriately.
Both Log4J 2 and Logback were designed before concurrency really got under way, and so any library that leverages MDC has to be managed carefully. There's any number of stack overflow posts for getting MDC to work with thread pools, as child threads do not inherit the MDC map. It's unfortunately not Logback specific – Log4J 2 also requires custom programming to get the thread local map working with thread pools, something that specifically came up with Finagle and is still pretty confusing. This problem only gets worse when using the
CompletionStage async API, which can bounce between different executors at will, at which point you need a
MdcCopyingCallableWrapper, or using compositional tracing techniques.
Ultimately, the best way to ensure that logging is easy is to be ensure that good logging is an explicit requirement and is baked into any new feature. Ideally, you can prove that you can trace everything by showing the control flow and actual vs expected results in your tests; tracking your logging through tests will also show when there are regressions and lacunae in your observability.
What To Log?
The next question is what should go into logging, especially debug and trace logging. Konrad has put together an excellent guide to logging in libraries in Swift. The situation is a little different in applications and services; because we know the audience, we can be more domain specific and determine the areas where additional logging is appropriate.
There's a temptation to be dogmatic and add hard and fast rules for logging. But often, logging isn't like that. Logging accumulates.
Like a desire path, good logging shows where people have gone before you.
Like witness marks, logging also shows the repairs and work that has gone into a section of code.
Logging starts with "printf debugging" in rough notes:
When you (think) you're done with a logging statement, resist the temptation to delete it or put it in a comment. If you delete, you lose the work you put to create it. If you comment it out, it will no longer be maintained, and as the code changes it will decay and become useless. Instead, place the logging statement in a conditional. (p111) – Effective Debugging
But it also incorporates the idea of a journal:
Log as if your program was writing a journal of its execution: major branching points, processes starting, etc., errors and other unusual events. – Logging Wisdom: How To Log
It should be possible to determine the internal state of the system at each point:
Logs should contain sufficient information to help with the reconstruction of critical state transitions. – Neal Hu, Logging Like a Pro
It should be debuggable rather than clean:
It is not to say that clean code is bad, but sometimes the practice of clean coding is more akin to sweeping problems under the rug. Debuggable code isn’t necessarily clean, and code that’s littered with checks or error handling rarely makes for pleasant reading. – Write code that’s easy to delete, and easy to debug too.
It should anticipate problems:
To put the conclusion in general form, we posit that the most important guideline to win against the debugging monster that is silently devouring our productivity is anticipation. Debugging should no more be looked at as an anomaly or an exception but rather as a significant part of the development effort. A more defensive approach toward debugging that anticipates and prepares to minimize the cost of debugging very early is needed. In mature hardware systems, an integral part of the system design is the diagnosis capabilities, which assume that any part of the system may fail. Likewise, software systems should have similar pessimistic approach that makes debugging an important part of the design. As Knuth put it: “The most effective debugging techniques seem to be those which are designed and built into the program itself”. – Towards an Understanding of the Causes of Difficulties in Debugging
And logging should pay particular attention to the critical path:
None of these tools are […] good in pointing out what critical paths and dependencies require the MOST attention. This is how things look like in the wild world. And the blue line is what a true critical path looks like. User requests come to the load balancer, goes through several servers all the way down to low-level disks. – Critical Path Driven Development
I think of the above concepts as elements of diagnostic logging.
I've called diagnostic logging "debug logs with an audience" but maybe a better phrase is "how would you guide someone else through this code and be a light in the darkness." As a concept, diagnostic logging has been around for a long time, but it has only become more important.
Good diagnostic logging adds waypoints and feedback so that a developer isn't starting from zero when looking at a potential bug, but it's more than just a mechanical process. Rather than "name=value" debug statements, diagnostic logging uses language specific to the domain in the message, and maps relevent data into the structure so it can be queried later. Diagnostic logging doesn't simply say what is happening, but also proactively suggests "if there were a bug, where would it be?" And where tasks are passed through asynchronous boundaries, logging explicitly manages context so that a task can be traced back to its original operation.
When a developer has a deep understanding of the code, they develop an intuition that can be difficult to put into words. Diagnostic logging is a place to put that intuition and that guidance and point out what objects and what states are significant. No debugger or instrumentation can do that – only a developer who cares.
The next steps are going to be about leveraging one of the unique strengths of structured logging: showing the change in state of a data structure over time. This is an emergent property of logging, because all you have to do is log a mutable object repeatedly and you have two log entries that you can compare for differences.
There is no general purpose tool for identifying and isolating changes in state. If you have explicit event driven state with deterministic logic that's ideal is easy, but you can't get that for free. Tools like diffx can isolate and render only the changed properties of a case class, but we can actually do much better on standardized JSON.
For example, we can isolate a particular branch as referring to an object:
We can then aggregate the patches to create a sequence of events over time. From a structural perspective, rendering a data structure changing over time is an example of a temporal network and can be managed with tools like Gephi that make it very easy to work with dynamic networks as a stream.
But we're not limited to visualizing just data structures. With a rich enough stream of events, we can also analyse and visualize data flow on the fly with logging as well. For example, we can visualize data flows using marble diagrams, shiviz, or dynamic object process graphs. We can feed tracing information into lttng or Tracecompass or Chrome Tracing and take advantage of their visualization and analysis.
Beyond that, there is a wealth of semantic information we can extract by enriching the semi structured data with semantic information from SemanticDB or Tasty. We can dig into Why-Across-Time Provenance and trace bad data back to its original calculation. We can start treating logs as knowledge graphs and inference engines and note unusual flows and edge cases automatically.
We are barely at the infancy of what we can do with logging, and I think that we can significantly improve the way we write and debug code; by providing a richer stream of structured logging data and bringing existing tools to enrich the feedback loop in ways we can't even imagine.