Applying Data Science to Logs for Developer Observability

TL;DR

By logging in a structured format like JSON and including high-cardinality attributes, logs are created as semi-structured data. As such, logs can be processed the same way as any other data, using data science tools like Apache Spark. Data science tools can be used by developers as part of the debugging and exploratory testing cycle – in particular, Spark can generate metrics from events, trace aggregate control flow, categorize logging events, and aggregate logging events to observe statistical trends and outliers.

Introduction

As a developer, the toolkit that for debugging hasn't really changed in twenty years. There's a debugger, a logging framework, and the program output itself. You can instrument with a profiler like Yourkit or attach Java Flight Recorder, but it's still pretty much the raw events and logs. If you want metrics, then it's DropWizard Metrics and you have gauges, counters, histograms, meters and timers. And that is pretty much it as far as summarizing program state goes. You can't really ask "What is my program doing when it runs" because it could be doing lots of things at once and if you turn on tracing everywhere then you get smooshed with a firehose of data because you're looking at them up close, line by line.

As Charity Majors said on Twitter:

Logs smoosh together two very different actions and degrees of scale: debugging code and debugging systems. For one you need a microscope, for the other you need a telescope.

The first time I sat down with data scientists, I saw logs from a different point of view. Data scientists have some really amazing telescopes.

If looking at logs is like looking at a series of individual trees, then working with their toolkit was like zooming out and seeing the forest. Rather than poke at individual events, they would look at generalized numbers of events in and out, and compare the rates. If they wanted a closer look, they'd hunt for outliers and odd distributions before drilling into individual events. If they wanted metrics, they'd generate them from the raw events.

Don't get me wrong, I wasn't ignorant. I knew that Apache Spark was a thing. I knew that Jupyter Notebooks were a thing. And yet, I completely didn't get the power and flexibility available through these tools. It's the same leap up that you get with SQL and Excel Spreadsheets. You'll wonder why you would care about outer joins or pivot tables. And then you have data that you care about, and suddenly outer joins are the answer to your prayers.

Well, I'm a developer and I have logs. These logs are data that I care about. Now these logs are just plain JSON, I can start doing exploratory data analysis – I can install Apache Spark and Jupyter Notebook, import all my logs, and use this data to answer questions, find outliers and unusual datapoints, and present information in multiple ways. If I want metrics, I log the raw value. If I want tracing, I add trace, span, and parent span ids to the log. If I want to visualize a component, I announce the component lifecycle in the logs and give it a unique ID that I can refer to from other logs. It all gets turned into JSON and sucked into Spark: anything I can point to and declare, I can observe.

Ultimately, the draw of Apache Spark is that it's a general purpose tool that can do things outside of the vendor-provided feature-set. Colin Breck points this out in Observations on Observability:

The tools, while improving, remain nascent, narrowly focused, and somewhat naive. I believe there must be people experimenting with more sophisticated techniques, but these are not the people that are getting the attention of the observability space dominated by logging, metrics, and tracing. At a small scale, like when we are tracing a single request, or interpreting logs from a single container, these tools are necessarily powerful. Understanding the rich detail gives us a sense of control. But as we start to zoom out, at scale, trying to understand all of the detail is impossible — details do not compose. Our understanding of the system and our sense of control erodes.

These are problems that data scientists wrestle with every day: they take a firehose of data, and use Spark to break it down into streams.

Apache Spark

Apache Spark's data processing API is based around Spark SQL. There's two parts to the API: a DataSet is a strongly typed collection of data, and a Dataframe are Dataset with a generic row data type, literally a Dataset[Row].

Apache Spark is a great match for JSON logs because it supports NDJSON through DataFrameReader:

import org.apache.spark.sql.SparkSession
import spark.implicits._

val spark = SparkSession.builder().getOrCreate()
val fileDf = spark.read.json("application.ndjson")

You can also import JSON from PostgresSQL JSONB column, although it's a map since Spark does its own JSON parsing:

// Get the dataframe to the table
val df = spark.read.jdbc("jdbc:postgresql://127.0.0.1:5432/logback?user=logback", "logback")
// pull out the JSONB field as a string 
val stringDf = df.map((value: Row) => value.getString("evt"), Encoders.STRING)
// convert the string to JSON
val jsonDf = spark.read.json(stringDf)

Because Spark understands semi-structured data very well, there's a rich API for [transforming complex data formats] (https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html).

Jupyter Notebooks

Data Scientists have raw data, but they don't show the raw data to people. Instead, they use a notebook interface . Notebooks have become super popular, because they're essentially fancy web pages.

The great thing about a is that it allows you to show your work, and build on calculations. Rather than being a series of rows in a table, notebooks can make an argument and present data in multiple ways. They allow visualization with diagrams, and the addition of text blocks in between calcuations. The datasources are available directly from the notebook, meaning there's absolutely no setup, and you can write code in the notebook itself. They allow telling a story. Notebooks can be viewed and worked on by multiple people, and can even be versioned.

Basically, it's Google Docs if you could write SQL in some boxes and have it show you the query results.

This is pretty much perfect for developers, because if you want to discuss some code behavior to someone, you're almost certainly showing one or more of the following:

Logs of the application.
Visualization of the design
Pictures and sketches of UI, where appropriate.
References to source code URLs in Github.
Code snippets in Markdown.
Gists of random text and notes.

All of this would be perfect in a notebook type interface, because rather than creating a bug in JIRA and filling out a dozen extra fields and using a workflow, you can create a notebook page and put something together collaboratively. If you've got the application and you've got logs with tracing and component lifecycle information, you can essentially create visualizations on the fly. You can specify date ranges, sort according to multiple criteria, and generally process the hell out of your data.

Implementation

I think there's a common misconception that you need a monster machine to play with Big Data Tools, or that you need a massively complex Kubernetes install with lots of knobs and dials. The good news is, other people have already done the hard work. You can run jupyter/all-spark-notebook docker image, or you can run the Community Edition of Databricks, which will let you run a hosted Spark instance for free.

I use a Docker container on my local development laptop using Gary Stafford's Getting Started with PySpark for Big Data Analytics using Jupyter Notebooks and Jupyter Docker Stacks, which is fine for the datasets and the calculations I'm running. Just to give you an idea of how easy it is to get started, the setup is essentially the following:

docker pull jupyter/all-spark-notebook:latest
git clone \
  --branch master --single-branch --depth 1 --no-tags \
  https://github.com/garystafford/pyspark-setup-demo.git
cd pyspark-setup-demo
docker swarm init
docker stack deploy -c stack.yml pyspark

I highly recommend reading through the tutorial in its entirety, as it's both detailed and comprehensive. It also has pretty pictures.

There's a couple of things you should do to make logs easier to work with. You should stay clear of Spark's Timestamp type because it is only accurate to the second. You should also map to a defined Dataset so you can have strongly typed queries in Scala.

I am not a master of Spark by any means: I lean heavily on Mastering Spark SQL and have a copy of Spark: The Definitive Guide on hand.

Exploration

Once we've got the data in Spark, we can start asking questions about the program that goes beyond individual logs. Analyzing Log Data With Apache Spark does a great job of discussing log data analysis in a production context.

The obvious thing we can do is generate metrics. When we have individual logging events, we can generate an aggregate of that data. If it's a numeric value, we can also find the average and calculate statistical outliers. This means we can look at the average duration of a request, sum the numbers of exceptions over time, and so on.

When looking at trace data, the norm is to click on a single trace and examine it. But instead of looking at a single span, we can look at the trace in the context of the average duration, and see the minimum, maximum and outliers of the span duration. Daniela Miao has a good talk on gathering data on trace aggregates that points out the advantages of showing a trace in context.

We can also look at logs as telemetry data from a developer perspective. For example, we can look at debugging data flow in a reactive program such as Akka Streams, and break down the flow of a graph by stages, and seeing where elements spend the most time.

We can also look at implicit relationships between components of the system. Say that we have two components which may appear independent, but appear to be oddly synchronous. Using Spark, we can run regression analysis and determine how often method "X" is called immediately before method "Y".

Conclusion

Most of the blog posts and code I've written has been about generating logs as JSON and incorporating tracing and traditional metric data into the context of logging events. Using Spark, you can see relationships and test hypotheses that would awkward or impossible for other developer tools, and observe program state and flow from a broader point of view.

TL;DR

Introduction

Apache Spark

Jupyter Notebooks

Implementation

Exploration

Conclusion

Comments

Related Posts

Useful LLM Agent Tools 21 Jun 2025

Making An LLM That Just Works For My Brother 13 Apr 2025

Step by Step RAG 24 Mar 2025