Queryable Logging with Blacklite

TL;DR I've published a logging appender called Blacklite that writes logging events to SQLite databases. It has good throughput and low latency, and comes with archivers that can delete old entries, roll over databases, and compress log entries using dictionary compression; compression that looks for common elements across small messages and extracts it into a common dictionary.

Overview

Blacklite is intended for diagnostic logging in applications running in staging and production. When things go wrong and you need more detail than an error log entry can provide, Blacklite can reveal all of the messy detail in your application. Because Blacklite uses SQLite as a backend, the JSON API lets the application search for relevant debugging and trace entries – typically by correlation id – and can determine a response.

For example, you may aggregate diagnostic logs to centralized logging, attach it to a distributed tracing as log events, or even just send an email – see the showcase application for details.

Diagnostic logging has no operational costs, because it's local and on demand. There are no centralized logging costs involved – no indexing, no storage costs, no Splunk licenses or Elasticsearch clusters. Logging at a DEBUG level in your application using Blacklite has minimal overhead on the application itself. Archiving ensures that the database size is bounded and will never fill up a partition.

SQLite databases are also better log files in general. Queries are faster than parsing through flat files, with all the power of SQL. A vacuumed SQLite database is only barely larger than flat file logs. They are as easy to store and transport as flat file logs, but work much better when merging out of order or interleaved data between two logs. Unlike flat file JSON, additional information can be added as ancillary tables in the database without disturbing the main log file, allowing queries by reference outside the original application context. And of course, SQLite has total support across all platforms, and works very well within the larger data-processing ecosystem.

So that's the sales pitch. Now let's do the fun question – how and why did it get here?

I started off this blog post by writing out the requirements for a forensic logger as if I had total knowledge of what the goal was. But that's not what happened. The real story is messy, discovering the requirements piecemeal, and involves lots of backtracking over several months. I think it's more interesting and human to talk about.

The first part is dissatisfaction.

Dissatisfaction with Ring Buffer Logging

Blacklite is the second try at writing a diagnostic appender. The first try was ring buffer logging, appending log events to an in-memory circular buffer using JCTools. This was very fast and simple to implement, but there were some subtle and not so subtle issues with it.

The main issue was that I was using the raw input from the framework – the log event – rather than running the logging event through an encoder and producing a serializable output, the log entry. This avoided the overhead of encoding every event, but also caused a number of downstream issues.

There were memory issues: the logging event could hold references to arbitrary objects passed in as input, meaning there could be memory leaks. I was using an on-heap memory queue, and in a large queue objects were more likely to be tenured and go through full garbage collection.

There were serialization issues: the logging event had no serializable format, so it couldn't be compressed, encoded, or transformed in any meaningful way.

There were querying issues: because an event's markers and arguments can be of any type, there was no common format to read elements and properties.

There were persistence issues: If the application crashed, the logs were gone. If there was a long running operation that exceeded the size of the in-memory buffer, the logs were gone. If you wanted to open up the logs in an editor or save them off for later, the logs were gone.

I worked around the event/entry dichotomy by "dumping" events into another appender with a JSON encoder on request, and doing querying and serialization from there. This… sort of worked. But it required the application to know what was going on.

Binary Encoding

I wish I could say that I was aware of all the issues at once, but the first thing I thought about was saving memory. If I encoded events, I could produce a stream of serialized entries, and I could compress all the entries and send them to off-heap memory where they wouldn't have to worry about garbage collection. I'd need a binary encoding that was concise and fast.

I will frankly admit that there are several cool hacks involved in binary encoding that I wanted to try. For example, the single fastest logger is nanolog. Nanolog does 80 million ops/s, with a median latency of 7 ns. The author wrote his doctoral thesis on the implementation.

If you look behind the scenes of Java Flight Recorder, there are commonalities with nanolog. it uses a binary encoding of events (with fun things like LEB128 integer encoding and constant pool strings) to thread-local buffers to a global ring buffer with periodic flushing to memory mapped JFR files. Marcus Hirt has some great blog posts talking about JFR events and streaming in general and it's worth a read.

For what I'm doing, the bedrock assumption is a transformation from an SLF4J logging event to bytes. In Logback and Log4J2, the assumption is that the encoder is responsible for the output format, so I looked to see if there was a binary encoding that could capture the essence of structured logging: essentially the JSON infoset model. Sadly, JFR itself has a strict model of what types of fields are allowed in an event, so I had to keep looking.

I tried using several binary encodings. Jackson provides binary encoding formats, so it was trivial to run through Avro, CBOR, ION, Protobuf, and SMILE. I also tried with FlatBuffer and SBE. Other options that looked interesting, such as Karl Stenerud's concise encoding, did not have Java support and so were inaccessible.

All the projects talk about speed, efficiency, binary compatibility, and cross-platform support. Despite that, they're really designed for different use cases. Avro was designed in the context of data processing at Hadoop. Protobuf is designed in the context of inter-machine communication at Google. SBE was designed for garbage-free, low-latency applications as part of Aeron. Flatbuffer was designed in the context of serializing state in Android game development at Google. Ion was designed for reliable document based semi-structured data storage by Amazon.

I quickly found out that binary encodings have implications based on their assumptions. Many binary encodings – Protobuf, Avro, FlatBuffer – use a predefined schema and require compiled code. Others, like Amazon ION and SMILE, support ad-hoc properties, and can handle JSON infoset features like nested objects and arrays.

Initially, I went with FlatBuffer. Here's an example, using a predefined schema for the logging event and using FlexBuffer option for the content. This wasn't so great, because I still had to read the binary data and convert it back to something I could effectively query. Using Amazon ION's binary format was more effective, but was still awkward due to API issues such as not supporting nanoseconds and some odd encoding decisions.

Off Heap Memory

My second attempt came from looking at off heap memory buffers for ring buffer logging. My hypothesis was that writing to off-heap memory would reduce the overhead of garbage collection, and be more effective than allocating multiple objects.

There are various tools for doing this, notably Direct ByteBuffer, Chronicle-Bytes, Agrona, larray, and low-gc-membuffers. Java 14 also provides direct support for off-heap memory access with Project Panama.

There are some interesting things you can do with off-heap memory, but there are also many gotchas, ranging from out of memory errors, crashing the JVM, determining the size of entries and exceeding MaxDirectMemorySize. I did play with low-gc-membuffers, but I ultimately passed on this after looking at memory mapped persistence options.

Memory Mapped I/O

In-memory ring buffers write directly to memory, but do not write to a file. Memory mapped I/O looked like the best of both worlds: write directly to a file, but have the operating system treat it like memory. Using a memory mapped file would be more reliable and just as performant.

There are issues where a memory-mapped file will block until disk I/O completes. There are various reasons why writes can stall, and some better APIs are coming that may help, but files are fraught with peril in general, and have proven to be the worst thing ever.

There are two ways to handle this, from the outside or from the inside.

From the outside, the file can be written to a tmpfs filesystem, which does not use fsync. I don't see a way to mandate that files only exist on a tmpfs filesystem, although Alluxiao does this. The disadvantage of using tmpfs is that it is transient, but there are ways of tranferring logs to regular storage on shutdown. The simplest thing would be to write to /tmp itself, but this also a surprisingly complex and mindbendingly fiddly topic.

From the inside, the easiest (and looking at JFR, probably fastest) thing to do is to write entries to a queue first, and then periodically consume entries from the queue to write to the file. So now we have an in-memory buffer even with memory mapping!

Rethinking Binary Encoding

When I started writing entries to a filesystem, querying and reading revealed itself to be a larger issue.

I now needed a tool that would read the binary file and turn it into something that I could read, query and export. I needed to be able to transport the binary file and have remote applications be able to read it. It needed to be developer friendly. And it wasn't.

It's easy to forget that there's no such thing as "text" in the abstract. Instead, there's a binary format "UTF-8" that editors will helpfully show as text to you, and will let you perform seemingly simple operations like "search and replace" and "jump to line". All of these features don't exist in a binary format. These were all things that I had taken for granted, and the last thing I wanted to do when debugging a problem was to deal with a substandard way of navigating logs. Bottom line: it's nice to be fast and efficient, but it must be diagnosable. If it's not convenient for developers to use, it doesn't matter how fast it is.

The other surprise was that JSON isn't all that bad, and performance depends more on the code than on the format itself. For example, looking at jvm-serializers, many JSON libraries are actually faster than flatbuffers! Looking at a more recent JMH benchmark, Jackson is slower than dsl-json and jsoniter, but still faster than some binary encodings.

I was surprised by how space-efficient JSON is when it comes to hetrogeneous data. SMILE, CBOR and MessagePack would yield incremental improvements, but nothing groundbreaking. When binary encodings offered greater efficiencies, they would assume large documents with repeating entries and do so with back references and symbol tables that looked for common symbols within entries, not across entries – ION was more advanced in offering shared symbol tables, but only with some work.

I was also surprised by the maturity of JSON querying. Using JsonPath or JSON Pointer made querying much easier, but I didn't appreciate how much development effort has been put into parsing and querying JSON efficiently. It is possible to parse gigabytes of JSON per second. It is possible to query a JSON document even faster that than by skipping over non-matching input.

In retrospect, looking at the encoder was starting in the middle. I was concerned about the overhead of encoding, but I had not thought about the complexities of reading and querying encoded entries. I was trying to save memory and reduce garbage collection, but I had not yet done the benchmarking using newer garbage collectors that would have told me that garbage-free code is not as critical as it used to be.

I had also made assumptions about the input stream of events. I assumed that debugging would be enabled at the root, and the application would have no control over logging after startup. As a result, I imagined that the appender would be "drinking from a firehose" and would have to use every trick in the book to minimize processing. These are good assumptions when dealing with data processing and networking which are business-critical, or when dealing with operational or audit logging. Diagnostic logging is best-effort, and can sample for significant flows based on input conditions like feature-flags and tracer bullets. It doesn't need to have 100% coverage.

I came to the conclusion that the main advantage of using a binary encoding for logging was that it works well as an intermediate format for data that must be written, but may not be read. As soon as the log entry becomes visible outside the application, you either need a globally viable format, or a fully fledged application like Java Mission Control that can display the binary format appropriately.

I gave up on binary encoding, and went back to JSON.

Dictionary Compression

Some of the binary encoding work did turn out to be useful, because I had a general codec API.

I already knew that unconstrained diagnostic logging to a file accumulated data very quickly, up to 56 GB in five minutes, or 184.32 MiB per second. I wanted to add an archiver that would apply zstandard dictionary compression and turn that into a still large but more reasonable 46 MB/s.

ZStandard is a great compression format overall, but dictionary compression is a use case that is ideal for logs – small amounts of repetitive data. Using a dictionary improved compression between 2x and 5x times compression without a dictionary. The gating factor is creating the dictionary to begin with.

Implementing this turned out to be very easy due to the zstd-jni API support. All I had to do was run the trainer until it said it was done, and then set the dictionary on the contexts.

SQLite

I was still struggling with querying and rollover, though. Complex JSON queries are no fun (especially when they involve date logic), and thanks to having a circular buffer, I wanted a more graceful rollover that would move entries rather than files.

The final breakthrough came when I was looking at SQL databases with JSON support, like Postgres, and saw that SQLite has JSON support out of the box, can index JSON expressions, and can expose JSON paths as generated columns. This immediately made querying very simple because the application could simply use JDBC, without having to install and maintain a Postgres database.

SQLite has Java support through sqlite-jdbc. It requires no configuration at all, and has native libraries included in the JAR. This immediately makes it a swap in for Java databases like H2, which is likewise embedded but doesn't have extensive ecosystem support.

SQLite has excellent ecosystem support, so much so that an SQLite database file is the only universal binary format accepted by the Library of Congress. The guidelines on appropriate uses for SQLite also seem very applicable to log file formats in general.

The next question was memory mapping. SQLite supports memory mapping out of the box with a pragma.

Okay, but what about transaction performance?

Italo Baeza Cabrera wrote a great blog post Laravel: Optimizing SQLite to dangerous speeds that settled any performance concerns. By enabling Write Ahead Logging, setting synchronous=NORMAL, and batching commits, you can get insert performance equivalent to a direct flat-file append in the general case. Eric Draken's SQLite Performance Testing was also useful, as was The Dangers and Complexities of SQLite Benchmarkingslides are more fun though.

Once I moved to SQLite as the backing store, a number of issues just fell away. All of the logic for handling files and direct buffers just vanished. Instead, what I was left with was an entries table, with the number of seconds since epoch, a nanosecond component, a numeric log level, and the content as a stream of bytes.

CREATE TABLE IF NOT EXISTS entries (
  epoch_secs LONG,
  nanos INTEGER, 
  level INTEGER,  
  content BLOB
);

I found some additional performance wins by relying on the internal _rowid_ used by SQLite tables for determining insert order, as opposed to setting up an explicit COUNT(*) query (which does a full table scan). I found that I could use leverage _rowid_ as a circular buffer and avoid using an index on the timestamp.

This gave me with the following benchmark:

DefaultEntryStoreBenchmark.benchmark         thrpt   20  803206.197 ± 54816.464  ops/s

That's roughly 803 inserts per millisecond. For comparison, a file appender with immediateFlush=false is roughly ~1789 ops/ms.

I had issues with random crashes in the JVM until I realized that SQLite believes that threads are evil and will crash the JVM with SIGABRT if you have ExecutorService or CompletableFuture in your code. As such, there's one thread that opens a connection and inserts. There's another thread that periodically opens a connection, opens another SQLite connection to the same database to do archiving, and then closes it. Because the JDBC connection in this context is a file open, this is actually the right thing to do rather than fancy hikari database connection pooling.

The next step was integrating the zstandard dictionary compression and adding "rolling" of SQLite databases so that older log entries could be aged out. Compression turned out to be as simple as adding SQLite custom functions. The rolling code came down to eyeballing what the Logback and Log4J code did for file rollover, then doing the same thing.

The end result looks something like this:

<configuration>
    <shutdownHook class="ch.qos.logback.core.hook.DelayingShutdownHook">
        <delay>1000</delay>
    </shutdownHook>

    <appender name="BLACKLITE" class="com.tersesystems.blacklite.logback.BlackliteAppender">
        <file>/var/log/blacklite/live.db</file>

        <archiver class="com.tersesystems.blacklite.archive.RollingArchiver">
            <archiveAfterRows>5000</archiveAfterRows>
                    
            <file>/var/log/blacklite/archive.db</file>
                        
            <codec class="com.tersesystems.blacklite.codec.zstd.ZstdDictCodec">
              <level>9</level>
              <repository class="com.tersesystems.blacklite.codec.zstd.ZstdDictFileRepository">
                <file>/var/log/blacklite/dictionary</file>
              </repository>
            </codec>

            <triggeringPolicy class="com.tersesystems.blacklite.archive.RowBasedTriggeringPolicy">
                <maximumNumRows>500000</maximumNumRows>
            </triggeringPolicy>
        
            <rollingStrategy class="com.tersesystems.blacklite.logback.TimeBasedRollingStrategy">
                <fileNamePattern>/var/log/blacklite/archive.%d{yyyyMMdd'T'hhmm,utc}.db</fileNamePattern>
                <maxHistory>20</maxHistory>
            </rollingStrategy>
        </archiver>

        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        </encoder>
    </appender>

    <root level="DEBUG">
        <appender-ref ref="BLACKLITE"/>
    </root>
</configuration>

Here, only 5000 entries are kept in the live database. When another entry is inserted, the oldest entries are run through dictionary compression and moved to the archive database. After 500000 entries accumulate, the archive.db file is renamed to archive.<timestamp>.db until there are 20 database files, at which point the oldest archive is deleted.

Querying

Finally, querying. I wasn't happy with SQLite's handling of timestamps and date ranges, and figured I could probably do better.

Natty is a natural language date parser that is capable of parsing formal dates (02/28/1979), relaxed dates (oct 1st), and relative dates (the day before next thursday). Some more date parsing options are here.

I stuck together a command line tool using Picocli and Natty.

./blacklite-reader \
  --after="2020-11-03 19:22:09" \
  --before="2020-11-03 19:22:11" \
  --timezone=PST \
  /tmp/blacklite/archive.2020-11-03-07-22.669.db

I also checked to see how hard it was to use Python to query some compressed data to make sure I could handle more complex data:

#!/usr/bin/env python3

# sudo pip3 install sqlite-utils
# sudo pip3 install zstandard

from sqlite_utils import Database
import time
import json
import zstandard as zstd

db = Database("archive.db")

epoch_time = int(time.time())

dict_row = db.execute("select dict_bytes from zstd_dicts LIMIT 1").fetchone()
dict = zstd.ZstdCompressionDict(dict_row[0])
dctx = zstd.ZstdDecompressor(dict_data = dict)

for row in db["entries"].rows_where("epoch_secs < ? limit 1", [epoch_time]):
    epoch_secs = row['epoch_secs']
    level = row['level']
    content = json.loads(dctx.decompress(row['content']))
    print("epoch_secs = ", epoch_secs, "level = ", level, "message = ", content['message'])

or use SQL custom functions:

#!/usr/bin/env python3

from sqlite_utils import Database
import time
import json
import zstandard as zstd

db = Database("archive.db")

epoch_time = int(time.time())

dict_row = db.execute("select dict_bytes from zstd_dicts LIMIT 1").fetchone()
dict = zstd.ZstdCompressionDict(dict_row[0])
dctx = zstd.ZstdDecompressor(dict_data = dict)

@db.register_function
def decode(s):
    return dctx.decompress(s)

sql = """select
timestamp_utc,
json_extract(decode(content),'$.message') as message,
decode(content) as decoded
from entries_view limit 1
"""

row = db.execute(sql).fetchone()

ts = row[0]
message = row[1]
decoded = row[2]
print("ts = ", ts, "message = ", message, "decoded = ", decoded)

All of the regular database tools are available as well. There's a GUI browser for SQLite, DB Browser for SQLite, and I added some notes on using sqlite3 just in case.

Next Steps

I'm tremendously pleased with Blacklite. Using SQLite opens up the possibilities I still haven't really explored yet.

One of the biggest things I'm interested in diagnostic logging is visualization, and with SQLite that almost comes for free. DBHub lets you export and run visualization queries by uploading SQLite files directly. Even more fancy is ObservableHQ. I've played with this in the past using PostgreSQL, but it's so much simpler using SQLite.

Likewise, the ability to query data local to the machine is something that I've seen done with ctlstore, but it's another thing to be able to expose that data and share it immediately.

For example, Datasette lets you expose logs and run SQL queries without having direct access to the database. You can use a JSON API or even provide direct links to individual rows. You can use Datasette with SQLite to provide remote access to your log files immediately without having to go through ELK stack setup.

I'm still playing with the idea of a hybrid approach that encodes entries to ring buffer using a dictionary compressed (fast zstd or lz4) binary format that will only transform to text format when dumping to SQLite on an error.

Finally, there's the temptation to create a Voltron like conglomeration of diagnostic logs by adding a layer on top of Datasette that would send queries to individual nodes and aggregate them. This is probably never going to happen, but it's fun to think about.

Comments