I've talked about logging with ring buffers before, in Triggering Diagnostic Logging on Exception. The basic idea is that you log DEBUG and TRACE statements to some in-memory data store, and only keep the last 1000 records. If there's an exception, you dump those records to the persistent logs so you can debug what led up to the exception.
At that time, it was still pretty rough. I was still hampered by the append-only nature of files, and so I put the stored diagnostic logging into a nested JSON field. I also didn't have a good way to query and filter on records so that only relevant contextual diagnostic information made it into the event.
I'm happy to say that those problems are solved, and I've got a showcase application that dumps records to an in-memory SQL database on exception, then queries the database for records with a correlation id. On top of that, it also integrates with Sentry and Honeycomb to display the diagnostic statements as breadcrumbs or traces, respectively. I am calling this feature "backtracing" because
TRACE statements can go away and then comb back. I think this is safe – I don't think anyone's used "backtrace" and "stacktrace" as synonyms for years now, and I think this makes more sense anyway. (I did find a support article from Veritas that also uses this terminology, so I am vindicated.)
As an example, I've got two screenshots here, one showing a trace with diagnostic information rendered as span events:
and another one showing diagnostic logging as just plain spans:
On top of that… because it's an SQL database, I can expose those logs directly through the web application using bog-standard JDBC. See demo instance on Heroku, click on the flaky link, and then follow the links to see the records with that correlation id.
Then click on the logs link to see all the logs, not filtered by correlation id.
You can also see the detail page, which will show you the JSON fields, and a link to Honeycomb. No link to Sentry yet, I'm still trying to figure out how to construct a URL that points to a specific Sentry issue, but I do get the Sentry email.
The Technical Details
There were several interesting/fun things involved in doing this.
The first problem was setting up ring buffers to be more flexible, and upgrading them to be a first class construct.
First: give up on the idea of ring buffers containing processed bytes. Different appenders contain different encoders, so from now on all ring buffers contain instances of
ILoggingEvent. Also, upgrade the ring buffer implementation to use a lock free queue.
Second: set up a
RingBufferAction and a
RingBufferRefAction in Joran, which gave the ability to reference specific ring buffers by reference and hook them into appenders.
Third: create a
RingBufferAwareAppender which takes an
appender-ref and a
ringbuffer-ref and a filter. If a record passes the filter, it goes into the appender. If not, it goes into the ring buffer.
Now we've got the ring buffer populated, but we still need to be able to dump when necessary.
Fourth step: create an appender which does not actually append anything directly, but dumps the records from the ring buffer into another appender, and create a custom logger that is going to be our handle to it.
Okay, so far so good. Now we need to make sure that the JDBC appender knows that the correlation id is special and can be queried.
Setting up JDBC
Fifth step: Break down the PostgreSQL appender down to a JDBC appender that we can extend from, so we can use it with H2 without using any PGObject code. Change the implementation so that subclasses can add extra columns to the database and define indices on startup. Make it use hikaricp as an underlying connection pool.
Sixth: Add a correlationid module to terse-logback. This is a standalone, vertically oriented set of Logback classes that only deals with correlation id:
CorrelationIdMarker etc. Set this up in the showcase:
Now we're done. Set up the showcase with a
LogEntryFinder to query for records with a correlation id and present them.
Setting up nanoseconds
First problem: records show up out of order. Is it the JDBC timestamp? No, that has enough resolution. It's Logback: it uses
System.currentTimeMillis and so several records show up with the same timestamp. No problem, I'll sort by the H2 serial primary key.
Problem: records show out of order even with the serial id. Right – because HikariCP has a connection pool, insertion order isn't guaranteed. No problem, there's a
UniqueIdLoggingEvent that decorates Logback's
LoggingEvent and provides a flake id. I'll sort by that.
Second problem: records show up in the right order, but we still don't know how much time elapsed between the two records. I'll know, I'll use
System.nanoTime(), and create an event that decorates
LoggingEvent with a relative nanotime – now i know that the time between two events is 11981656 - 11808036 = 0.17362 milliseconds.
Third problem: Now I have two classes that decorate
LoggingEvent. And there might be more. What I really want are mixins.
Now I can do
event.getComponent(NanoTimeSupplier.class).getNanoTime() and only have one wrapper around the
LoggingEvent – plus, now I can now add as many components to the event as I want, without having to go through markers or structured arguments.
Setting up Play
Now to create spans in Honeycomb. This requires… a start time. Let's add an
Instant to the request as a typed attribute in a Play Filter. Then when an error happens, we can capture it in the error handler, and pull it out of the request.
First problem: start times aren't there in the request. It looks like the error handler returns the original, unmodified request, rather than the modified request with the start time. Okay… we can still do this. There's a place where all requests start from, the
RequestFactory. We'll just override Play's application loader to use a modified request factory with a start time. Done.
Oh, and tie JMX beans directly to the controller so we can change log levels without having to type out the full logging path. Okay, now done.
I'm really happy with this. Getting good debugging information out of an application has bugged me for years, and so finally dealing with this makes me feel quite chuffed. I started this almost a year ago now, and I'm still not done. But next up is building a decent Scala Logging API.