Semantic Logging with JSON-LD

TL;DR I've implemented JSON-LD binding in Blindsight. I think that JSON-LD is a better logging format than raw JSON, because it solves some existing problems with JSON and enables some exciting possibilities in diagnostic logging and debugging.

Issues

The first step was admitting that there were some issues, both in JSON and in Blindsight's DSL binding. I'll go over the issues first, then describe how JSON-LD is a better format for logging, and how a better binding DSL came out of it.

Issues in JSON

The state of play in logging has advanced from unstructured text to semi-structured data in JSON. This is great. I'm all for it.

However, there are still issues with logging data in JSON. Many of these issues can be traced back to JSON's data representation and lack of context.

JSON does not understand unit types. For example, how do you represent miles vs kilometers in JSON? How do you parse dates or represent a duration? These are things that JSON cannot tell you.

JSON does not have a representation of properties. A JSON object may contain anything or nothing. Knowing what properties can be found in a JSON object requires outside context. This also applies when determining if one property is an alias for another property.

JSON represents data as an array, but does not indicate if the order of elements is significant or not. Is it a list, or a set? This gets especially messy when looking at complex data with nested arrays.

JSON has no way of referencing itself; there's no way of linking to another node or object inside the JSON document, or outside the JSON document for that matter. There is no "URL" in JSON, only string.

These are limitations of the format, but this comes out as specific problems in logging.

Pretty much the first issue everyone runs into in logging is rendering complex data as JSON objects. Rendering a date information is almost always ISO 8601, but rendering an Event that has a time, a place, and an organizer requires a JSON object with various properties. Those properties must be tracked, must have a certain type, and must have a certain meaning.

This becomes worse when rendering a program's internal state at a DEBUG logging level. Internal state typically means some kind of map or list in memory that may contain small blobs of data if you're lucky, but typically contains references to great big mounds of data that may also need manual rendering. There's no standard way of serializing a reference to another object. Even worse, some internal state may consist of graphs that point back to themselves in a cycle. JSON alone doesn't have a model for this.

Issues in Binding

While I was looking at issues in JSON, I was also taking a fresh look at the DSL representation for complex data in Blindsight. I found that some of the same issues in JSON also percolated through Blindsight.

Blindsight is implemented in Scala, which means that it's possible to leverage Scala's implicit support to say that there's a BObject representation that will let you log it:

case class Lotto(
  id: Long,
  winningNumbers: List[Int],
  winners: List[Winner],
  drawDate: Option[java.util.Date]
) {
  lazy val asBObject: BObject = "lotto" ->
    ("lotto-id"        -> id) ~
    ("winning-numbers" -> winningNumbers) ~
    ("draw-date"       -> drawDate.map(_.toString)) ~
    ("winners"         -> winners.map(w => w.asBObject))
}

object Lotto {
  implicit val toArgument: ToArgument[Lotto] = ToArgument { lotto => Argument(lotto.asBObject) }
}

And then Blindsight will render the Lotto instance:

val lotto: Lotto = Lotto(5, List(2, 45, 34, 23, 7, 5, 3), winners, None)
logger.info("message {}", lotto)

There are two great things about this as far as I'm concerned. The first thing is that conversion is automatic; no need for logger.info(message, wrapper(lotto)). The second thing is that a converter is mandatory and there 's no toString implication. If there's no ToArgument in implicit scope, then the compiler will fail with an error, and implicit resolution allows you to swap out converters or define priorities for multiple converters.

However, there is a flaw. While the logger.info call is typesafe, there's no such limitation on the definition itself. Leaving aside that the fields are just raw string literals, nothing in the code that lotto-id is a Long or that the draw-date contains a date in a particular format.

("lotto-id"        -> "not a number")

If some other object wanted to define a winners field and tie that to a string value, there would be nothing stopping it, and there wouldn't be a distinction between my winners field and the other winners field. In summary, I wanted to be able to bind fields together with particular types, and organically grow a schema from common fields.

The Solution

I was interested in finding a better logging format when looking at binary serialization options, but realized that any solution would have to come as a "semantically-aware" JSON. This led to the idea of using JSON-LD as a logging format. Rendering JSON-LD would not only be nearly transparent to existing logging, but it was also a great match on many other levels – in particular, JSON-LD can be converted to RDF triples. This means Semantic Web technology is available, and we can have true semantic logging.

Ironically, the idea of using RDF for logging isn't new at all. One of the first (documented) structured logging attempts in 2006 started out with N3 statements, before moving to JSON as a way to efficiently represent structures. The issues in using RDF and semantic web technologies are better covered in JSON-LD and Why I Hate the Semantic Web, but suffice to say that JSON-LD makes a point of being clear and concise and cares about communicating with its audience at every step.

I've written up the full documentation for JSON-LD binding, but I'd like to break down specifically how JSON-LD solves the issues mentioned above.

JSON-LD Improvements

In JSON-LD, all values have types. These types can be implicit, as in the case of native types like strings and booleans, or they can be explicit. Explicit types can be represented in a context definition, or can be defined inline using a value object with an IRI. This means there's no ambiguity on the meaning of a date, because the IRI determines the format.

{
  "modified": {
    "@value": "2010-05-29T14:17:39+02:00",
    "@type": "http://www.w3.org/2001/XMLSchema#dateTime"
  }
}

This can be represented in Blindsight with a custom value mapping:

implicit val localDateMapper: ValueMapper[LocalDate] = ValueMapper { date =>
  Value(DateTimeFormatter.ISO_DATE.format(date), xsdDate)
}

val dateCreated = yourSchema("dateCreated").bindValue[LocalDate]
val abridgedMobyDick = NodeObject(
  `@type` -> "Book",
  name -> "Moby Dick",
  dateCreated -> LocalDate.of(2020, 1, 1)
)

This applies for all unit types. If you're filing an expense report and had two trips, one in miles and another in kilometers, that's doable:

{
  "trips": [
    { "@value": 11, "@type": "units:mile" },
    { "@value": 15, "@type": "units:kilometer"}
  ]
}

This can be represented using Coulomb and custom value mapping:

import coulomb._
import coulomb.si._

// assume we have ValueMapper here
val firstTrip = 11.withUnit[Mile]
val secondTrip = 15.withUnit[Kilometer]
val tripsNode = NodeObject(trips -> Seq(firstTrip, secondTrip))

Next, there's the question of how to represent properties. In JSON-LD, specifying the type of a node object allows for an inference that properties associated with the type can be found in the node object. For example, given a type of Person, an application can look for a givenName property in the node object. Aliasing properties and keywords can be done in the context definition.

If there are two fields with the same name, compact IRIs can be used to disambiguate the field. If I have a field winners and another domain also has a field winners, the JSON-LD document can have terms lotto and olympic that point to IRIs, and then render lotto:winners and olympic:winners with no ambiguity.

JSON-LD leverages JSON arrays where possible, but extends JSON with the concepts of ordered lists, unordered sets, and indexed data aka maps. This can be extended to the concepts of "lists of lists" which can be useful for GeoJSON coordinates:

{
  "@context": {
    "@vocab": "https://purl.org/geojson/vocab#",
    "type": "@type",
    "bbox": {"@container": "@list"},
    "coordinates": {"@container": "@list"}
  },
  "type": "Feature",
  "bbox": [-10.0, -10.0, 10.0, 10.0],
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [-10.0, -10.0],
        [10.0, -10.0],
        [10.0, 10.0],
        [-10.0, -10.0]
      ]
    ]
  }
}

Finally, JSON-LD has built in self-referencing in the form of node identifiers. Node identifiers can be explicit IRI values that give a "stable" identifier such as URL or URI, or they can be blank node identifiers that are only good locally in the context of the document.

Binding Improvements

Because JSON-LD has well known semantics, Blindsight can provide stronger bindings. There's already been a demo of value binding, but stronger typing goes on through the rest of JSON-LD. Play's form binding implementation was a big inspiration here.

For example, you want to use a node identifier, but you want to ensure it can only be used with a given UserID type. Using the JSON-LD binding, you can ensure that only valid UserID instances can be used:

case class UserID(id: String)
implicit val userIdMapper: IRIValueMapper[UserID] = IRIValueMapper[UserID](uid => IRI(uid.id))
val userId: IRIBinding[UserID] = Keyword.`@id`.bindIRI[UserID]

val node = NodeObject(userId -> UserID("12345")) // IRI("12345") won't compile

The sample principle also applies to node object bindings:

case class MonetaryAmount(currency: Currency, value: Int)

object MonetaryAmount {
  implicit val monetaryAmountMapper: NodeObjectMapper[MonetaryAmount] = NodeObjectMapper { ma =>
    NodeObject(
      `@type` -> monetaryAmountType,
      currency -> ma.currency,
      value -> ma.value
    )
  }
}

val occupationType = schemaOrg("Occupation")
val monetaryAmountType = schemaOrg("MonetaryAmount")
val estimatedSalary = schemaOrg("estimatedSalary").bindObject[MonetaryAmount]

val occupation = NodeObject(
  `@type` -> occupationType,
  name -> "Code Monkey",
  estimatedSalary -> MonetaryAmount(USD, 1)
)

And lists also allow defining the type of element:

trait MyGeoContext {
  val vocab = IRI("https://purl.org/geojson/vocab#")
  val geometry = geoJson("geometry").bindObject[Geometry]

  implicit def seqMapper: NodeMapper[Seq[Double]] =
    NodeMapper { iter =>
      val mapper = implicitly[NodeMapper[Double]]
      ListObject(iter.map(mapper.mapNode))
    }

  val coordinates = geoJson("coordinates").bindList[Seq[Double]]
}

final case class Geometry(`@type`: String, coords: Seq[Seq[Double]])

object Geometry extends MyGeoContext {
  implicit val nodeMapper: NodeObjectMapper[Geometry] = NodeObjectMapper { geo =>
    val `@type` = Keyword.`@type`.bindIRI
    NodeObject(
      `@type` -> geo.`@type`,
      coordinates -> geo.coords
    )
  }
}

Maps are different, and don't have tight bindings the same way. I may add them in, but the complexities of map inputs make it harder to get something clean. Here's an id map example with just plain node objects as entries:

val post = schemaOrg("post").bindIdMap
val exampleCom = IRI("http://example.com/")
val baseExampleCom = exampleCom.base
val node = NodeObject(
  `@id` -> exampleCom,
  `@type` -> schemaOrg("Blog"),
  name -> "World Financial News",
  post -> Map(
    baseExampleCom("1/en") -> NodeObject(
      body -> "World commodities were up today with heavy trading of crude oil...",
      words -> 1539
    ),
    baseExampleCom("1/de") -> NodeObject(
      body -> "Die Werte an Warenbörsen stiegen im Sog eines starken Handels von Rohöl...",
      words -> 1204
    )
  )
)

All of the various bindings can be assembled in traits, and then collated together. IRIs and the various components (Base, Vocab, Term) can be centralized, and it should be intuitive to define JSON-LD context definitions just from looking at the traits and build up more structure than usually happens with raw JSON.

Conclusion

I'm pretty happy with the JSON-LD binding. It's optional and builds on functionality that's already there. It produces an AST that looks just like plain JSON. It's easily extended, and most importantly, it's unobtrusive. Users shouldn't need to think about how they are logging, but can set it up once and keep going.

I am still experimenting with rendering internal state effectively for debugging. Ideally, I'd like to use id: java:${jvmInstance}:${hashCode} as a default id for any given object, and then I can leverage that to render complex data effectively without having to do full traversal.

There's a whole other blog post about what you can do with true semantic logging, but I'll save that for later.

Comments