Ad-hoc structured log analysis with SQLite and DuckDB

Structured logging and databases are a natural match – there's easily consumed structured data on one side, and tools for querying and presenting data on the other. I've written a bit about querying structured logging with SQLite and the power of data science when it comes to logging by using Apache Spark. Using SQL has a number of advantages over using JSON processing tools or log viewers, such as the ability to progressively build up views while filtering or querying, better timestamp support, and the ability to do aggregate query logic.

But structured logging isn't what most databases are used to. The de-facto standard for structured logging is newline-delimited JSON (NDJSON), and there is only a loose concept of a "schema" – structured logging can have high cardinality, and there's usually only a few guaranteed common fields such as timestamp, level and logger_name. Getting an actual schema so that you can get NDJSON into a database is still a somewhat manual process compared to CSV. Spark is great at NDJSON dataframes, but Spark is a heavyweight solution that we can't just install on a host. What we really want is an in-process "no dependencies" database that understands NDJSON.

TL;DR: With NDJSON support, slurping structured logs into a "no dependencies" database like SQLite or DuckDB is easier than ever.

sqlite-lines

Alex Garcia released sqlite-lines in June specifically to read NDJSON.

Using sqlite3 can be more convenient than using jq or other JSON processing command line tools for digging around in logs. Adding the sqlite-lines extension is as simple as getting the static library:

$ wget https://github.com/asg017/sqlite-lines/releases/download/v0.1.1/lines0.so
$ sqlite3
sqlite> .load ./lines0

Processing can be done with the lines_read function which provides a table with a column line:

sqlite> select line from lines_read('application.json') limit 1;

This will produce JSON output like:

{
  "id": "FtYkeclkrh8dHaINzdiAAA",
  "relative_ns": -295200,
  "tse_ms": 1645548338725,
  "start_ms": null,
  "@timestamp": "2022-02-22T16:45:38.725Z",
  "@version": "1",
  "message": "Database [logging] initialized",
  "logger_name": "play.api.db.DefaultDBApi",
  "thread_name": "play-dev-mode-akka.actor.default-dispatcher-7",
  "level": "INFO",
  "level_value": 20000
}

We don't want to call lines_read all the time, so we'll import into a local table:

create table logs as select line from lines_read('application.json');

Combined with the jpointer operators added in 3.38.0, we can filter using JSONPath:

select line from logs where line->>'$.level' = 'ERROR';

This produces a JSON result that contains a giant stacktrace, and I only want the message. What makes SQLite so effective as a query tool is that it's very easy to progressively stack views to get only the data I want:

sqlite> create view shortlogs as 
  select line->'$.@timestamp' as timestamp, line->'$.message' as message, line->>'$.level' as level
from logs;
sqlite> select * from shortlogs where level = 'ERROR';
"2022-02-22T16:45:50.900Z"|"Internal server error for (GET}) [/flaky}]"|ERROR

Saving the table and exporting it to your local desktop is also very simple, and gives you the option of using a database GUI like DB Browser for SQLite.

Interestingly, sqlite-lines can be used with Datasette with datasette data.db --load-extension ./lines_nofs0 which would provide a web application UI for SQLite, but I haven't tried this.

DuckDB

SQLite does have some disadvantages in that it processes rows sequentially, and so asking it aggregate or analytical questions like "what are the 10 most common user agent strings" can take a while on large datasets. DuckDB is like SQLite, but focused on analytics – it focuses on processing entire columns at once, rather than a row at a time.

I haven't used duckdb extensively as it requires that a schema is defined before you import. That's not a problem with the new version: as of DuckDB 0.7.0, DuckDB can read NDJSON files and infer a schema from the values. There's a blog post with examples – let's try it out on logs and see what happens.

Releases are available on Github. Installation is a single binary zip file:

$ wget https://github.com/duckdb/duckdb/releases/download/v0.7.1/duckdb_cli-linux-amd64.zip
$ unzip duckdb_cli-linux-amd64.zip

We can import the JSON into a DuckDB table and save on the repeated processing, using read_ndjson_auto as it will let DuckDB parallelize better. The blog post says "DuckDB can also detect a few different DATE/TIMESTAMP formats within JSON strings, as well as TIME and UUID" – while it did see UUID, it did not see "@timestamp" as rfc3339. Not a huge deal, as we can use the REPLACE clause to manually cast to a timestamp on import:

CREATE TABLE logs AS 
  SELECT * REPLACE CAST("@timestamp" AS TIMESTAMP) as "@timestamp"   
FROM read_ndjson_auto('application.json'); 

And because DuckDB infers schema, when we do a describe logs it shows us all the json attributes as columns!

│ id              │ VARCHAR                                                      
│ relative_ns     │ BIGINT                                                       
│ tse_ms          │ UBIGINT                                                      
│ start_ms        │ UBIGINT                                                      
│ @timestamp      │ TIMESTAMP                                                    
│ @version        │ BIGINT                                                       
│ message         │ VARCHAR                                                      
│ logger_name     │ VARCHAR                                                      
│ thread_name     │ VARCHAR                                                      
│ level           │ VARCHAR                                                      
│ level_value     │ UBIGINT                                                      
│ correlation_id  │ BIGINT                                                       
│ stack_hash      │ VARCHAR                                                      
│ name            │ VARCHAR                                                      
│ trace.span_id   │ UUID                                                         
│ trace.parent_id │ INTEGER                                                      
│ trace.trace_id  │ UUID                                                         
│ service_name    │ VARCHAR                                                      
│ duration_ms     │ UBIGINT                                                      
│ request.method  │ VARCHAR                                                      
│ request.uri     │ VARCHAR                                                      
│ response.status │ UBIGINT                                                      
│ exception       │ STRUCT("name" VARCHAR, properties STRUCT(message VARCHAR))[] 
│ stack_trace     │ VARCHAR                                                      

The exception column is especially interesting, as DuckDB was able to infer name and properties inside it. The struct is 1-based, so the query to match on a specific exception message is exception[1].properties.message:

D select exception[1].properties.message from logs;
Execution exception[[IllegalStateException: Who could have foreseen this?]]

DuckDB's analytics means that we can do back of the envelope queries to see hidden patterns in logs. We can make use of DuckDB's aggregate functions.

We can start off with the average response time:

select avg(duration_ms) from logs;

We can get a better breakdown usinng a 24 hour range, to see if the average response is slower when there's more load:

SELECT AVG("duration_ms") OVER (
  ORDER BY "timestamp" ASC
  RANGE BETWEEN INTERVAL 12 HOURS PRECEDING
        AND INTERVAL 12 HOURS FOLLOWING)
FROM shortlogs;

It's even possible to do box and whisker queries:

SELECT timestamp,
    MIN("duration_ms") OVER day AS "Min",
    QUANTILE_CONT("duration_ms", [0.25, 0.5, 0.75]) OVER day AS "IQR",
    MAX("duration_ms") OVER day AS "Max",
FROM shortlogs
WINDOW day AS (
    ORDER BY "timestamp" ASC
    RANGE BETWEEN INTERVAL 12 HOURS PRECEDING
          AND INTERVAL 12 HOURS FOLLOWING)
ORDER BY 1, 2          

While DuckDB has advantages over SQLite, storage is not stable – newer versions of DuckDB cannot read old database files and vice versa. This has immediate impact as Tad is not capable of loading newer DuckDB files.

In addition, support for the STRUCT data structure in Parquet is iffy. DBeaver is capable of loading the database, but will not render the exception field, instead throwing SQL Error: Unsupported result column type STRUCT("name" VARCHAR, properties STRUCT(message VARCHAR))[]. Tad does support Parquet's struct format, but is a viewer only, so that's not very useful.

Best advice: only use DuckDB for computation and analytics, but use EXPORT DATABASE to snapshot work and write out data as JSON or to an attached SQLite database if you want long term portable storage.

Conclusion

If you want to quickly dig into structured logs, consider using SQLite or DuckDB over trying to process the JSON by hand – they take zero time to install and are vastly more powerful than anything you can do using jq or a log viewer.