Terse Systems

ExecutionContext.parasitic and Friends

2024-06-20T18:59:03-07:00

There's a lot of confusion around execution contexts and how they work in Scala. The description in Futures and Promises does a good job of explaining the concept of how Futures work, but does not explain what the difference is between ExecutionContext.global, ExecutionContext.parasitic, and ExecutionContext.opportunistic, and where and when you would want to use these.

ExecutionContext.global

Let's start off by explaining what execution contexts and executors actually do, starting with the global execution context. Here's the simplest possible program:

package example

import scala.concurrent.duration.Duration
import scala.concurrent.{Await, ExecutionContext, Future}

object Exercise1 {
  private val logger = org.slf4j.LoggerFactory.getLogger(getClass)

  def main(args: Array[String]): Unit = {
    val f = Future {
      Thread.sleep(1000) // simulate a blocking operation.
      logger.info("Executed future!")
    }(ExecutionContext.global)

    // block until the future completes
    Await.ready(f, Duration.Inf)
  }
}

The relevant bit is the Future(...)(ExecutionContext.global). This says to use the global execution context for the block in the Future. Under the hood, the execution context turns the block into a Runnable and calls executor.submit on it.

If you wanted to do it in a Java style, Future would look roughly like this:

val runnable = new Runnable() {
  override def run = {
    Thread.sleep(1000)
    logger.info("Executed future!")
  }
}
globalExecutor.execute(runnable)

When you run this program, you'll get a log statement like:

13:27:48.832 example.Exercise1$ INFO  [scala-execution-context-global-21]: Executed future!

Note that scala-execution-context-global-21 thread name – the global execution context has its own internal executor, which manages its own threads. The executor can technically do anything it wants to execute the runnable in any order, using any thread pool. Most executors will put the runnable on a task queue, which can either be bounded – they have a maximum size and will reject tasks if the queue is full – or unbounded – you can keep adding tasks until the JVM runs out of memory.

The goal of concurrency is to use as few resources as you can get away with, including threads. Creating threads and context switching between threads is expensive, and so for fork/join pools you want only as many threads as you have CPU cores. In addition, the more executor services you're using, the more work the system has to do to manage.

Using a Future.apply here is technically an anti-pattern. In production code, kicking off a Future is almost always done by an external service – either you're calling a database and waiting for a result, or you're accumulating a stream of bytes that will eventually turn into a parsable format, or you're using a callback service like GRPC and writing to a Promise. These external services will typically manage their own thread pools. In the majority of cases, you'll be dealing with the transformation of an existing Future, rather than creating your own.

Let's look at two futures:

object Exercise2 {
  private val logger = org.slf4j.LoggerFactory.getLogger(getClass)

  def main(args: Array[String]): Unit = {
    import ExecutionContext.Implicits.global

    val f1 = Future {
      logger.info("Executed future1!")
    }
    val f2 = Future {
      logger.info("Executed future2!")
    }

    val f1andf2 = f1.zip(f2)
    Await.ready(f1andf2, Duration.Inf)
  }
}

The two futures are created and placed into the task queue, and the global execution context attempts to run them at the same time:

13:35:00.235 example.Exercise2$ INFO  [scala-execution-context-global-21]: Executed future1!
13:35:00.235 example.Exercise2$ INFO  [scala-execution-context-global-22]: Executed future2!

Note that there's no requirement that f1 and f2 execute in order: concurrency means they can happen in any order, and parallelism means that they can both execute at once. Here, the futures executed in parallel, each using a different thread on a different CPU core.

Because Scala considers Future[_] to produce a value at some point in the future, all computations have to be through operators that compose Future[_], notably flatMap and map. Each of these also takes an implicit execution context, and is turned into a Runnable under the hood.

import ExecutionContext.Implicits.global

val f = Future {
  val s = "f1"
  logger.info(s)
  s
}.map { s =>
  val s2 = s + " f2"
  logger.info(s2)
  s2
}.map { s2 =>
  val s3 = s2 + " f3"
  logger.info(s3)
}

When we run this, all the map statements take the global execution context, and are executed sequentially because they depend on output.

12:14.406 example.Exercise3$ INFO  [scala-execution-context-global-21]: f1
12:14.408 example.Exercise3$ INFO  [scala-execution-context-global-21]: f1 f2
12:14.408 example.Exercise3$ INFO  [scala-execution-context-global-21]: f1 f2 f3

Whenever you have a Future block, you're using a Runnable. Whenever you're using map, flatMap, and so on, that's essentially jamming a bunch of Runnable together.

Note that these can run on the same thread, but it's entirely possible that each of these tasks run in a different thread. This can be needlessly expensive, as a context switch between threads incurs an overhead for what should be a straightforward operation.

Finally, ExecutionContext.global creates a ForkJoinPool for its executor, with some tweaks to size it for the CPU cores. A more detailed breakdown of the internals can be found here, but that's basically it.

There is a blocking construct that can improve parallelism, but I have never seen this used in production code – it's more maintainable to just use a dedicated custom execution context that's tuned for IO.

ExecutionContext.parasitic

The problems with Future underperforming in some scenarios were well known, and in 2.13 there were a bunch of changes to streamline and optimize performance.

One solution to this is to use a "synchronous" execution context, also known as a trampoline execution context. In 2.13, this is ExecutionContext.parasitic.

The pull request for ExecutionContext.parasitic explains the underlying mechanism: if the future is completed, it will run on the registering thread, and if the future is not yet completed, it will run on the completing thread.

This means that parasitic – despite being "synchronous" – is still not a solution to using execution contexts in conjunction with thread local storage. You will run into problems if you use an unmanaged parasitic with a system that relies on thread local storage, such as MDC or Opentelemetry.

For example, if we use parasitic here:

object Exercise4 {
  private val logger = org.slf4j.LoggerFactory.getLogger(getClass)

  def toIntFuture(f: Future[String]): Future[Int] = {
    Thread.sleep(1000)
    f.map { s =>
      logger.info(s"toIntFuture: converting $s")
      s.toInt
    }(ExecutionContext.parasitic)
  }

  def main(args: Array[String]): Unit = {
    val stringFuture = Future {
      logger.info("starting!")
      "42"
    }(ExecutionContext.global)
    val f = toIntFuture(stringFuture)
    val result = Await.result(f, Duration.Inf)
    logger.info(s"result = $result")
  }
}

Then the map runs on the main thread.

28:39.413 example.Exercise4$ INFO  [scala-execution-context-global-21]: starting!
28:40.414 example.Exercise4$ INFO  [main]: toIntFuture: converting 42
28:40.467 example.Exercise4$ INFO  [main]: result = 42

But move the Thread.sleep:

object Exercise4 {
  private val logger = org.slf4j.LoggerFactory.getLogger(getClass)

  def toIntFuture(f: Future[String]): Future[Int] = {
    f.map { s =>
      logger.info(s"toIntFuture: converting $s")
      s.toInt
    }(ExecutionContext.parasitic)
  }

  def main(args: Array[String]): Unit = {
    val stringFuture = Future {
      logger.info("starting!")
      Thread.sleep(1000)
      "42"
    }(ExecutionContext.global)
    val f = toIntFuture(stringFuture)
    val result = Await.result(f, Duration.Inf)
    logger.info(s"result = $result")
  }
}

And you'll see parasitic run on the global execution context.

05:56.776 example.Exercise4$ INFO  [scala-execution-context-global-21]: starting!
05:57.779 example.Exercise4$ INFO  [scala-execution-context-global-21]: toIntFuture: converting 42
05:57.789 example.Exercise4$ INFO  [main]: result = 42

The parasitic execution context is both useful and dangerous. Not only does it reduce the chance of a context switch, but it can also simplify code. For example, it's common for methods to look like this:

class FooService @Inject()(fooDAO: FooDAO)(implicit ec: ExecutionContext) {
  def getAge(id: ID): Future[Int] = {
    fooDAO.get(id).flatMap { foo =>
      foo.age
    }
  }
}

In this situation, the long-running operation to get a Foo shouldn't need an execution context just to map from foo to foo.age. Using parasitic for trivial maps like these and removing the class scoped implicit encourages each method to think about how to handle calls.

class FooService @Inject()(fooDAO: FooDAO) {
  def getAge(id: ID): Future[Int] = {
    fooDAO.get(id).flatMap { foo =>
      foo.age
    }(ExecutionContext.parasitic)
  }
}

However, you must be careful about how much work you're doing in the flatMap. The DAO will probably be using a ThreadPoolExecutor internally to manage queries to the database. If you use parasitic to piggyback a flatMap and are tying up the thread with complex logic, then you are taking resources away from the DAO's internal thread pool and also losing out on the work-stealing advantages you would gain by using a fork/join pool,

class FooService @Inject()(fooDAO: FooDAO) {
  def unsafeMethod(id: ID): Future[Int] = {
    fooDAO.get(id).flatMap { foo =>
      complexAnalytics(foo) // potentially long-running CPU-bound work
    }(ExecutionContext.parasitic) // unsafe to use parasitic here
  }
}

Similarly, you should only use parasitic when you control the executing block's logic. You should not use parasitic where you have a function or call-by-name parameter, for example, as these could block.

def unsafe(f: String => Int): Future[Int] = {
  // we don't control execution here, so technically f could block or be expensive.
  internalFuture().flatMap(s => f(s))(ExecutionContext.parasitic)
}

In this situation, you should be passing in an execution context as an implicit, as only the caller will know what the function does.

Managing Executors

One problem with ExecutionContext is that it is a blank sheet of paper. There's nothing to indicate what executor is actually at work, or when it's time to switch to a different executor.

One useful technique in managing executors is to associate them with strongly typed execution contexts. For example, Play has a CustomExecutionContext that can be extended with your own custom types.

class DatabaseExecutionContext @Inject()(system: ActorSystem) extends CustomExecutionContext(system, "database-dispatcher")

abstract class CustomExecutionContext(system: ActorSystem, name: String) extends ExecutionContextExecutor {
  private val dispatcher: MessageDispatcher = system.dispatchers.lookup(name)

  override def execute(command: Runnable) = dispatcher.execute(command)

  override def reportFailure(cause: Throwable) = dispatcher.reportFailure(cause)
}

class DatabaseService @Inject()(implicit executionContext: DatabaseExecutionContext) {
  // ...
}

This is not only useful in making sure that you know what execution context you're looking at, but it also helps in the logs, because the thread name will show up as database-dispatcher-1.

Rather than using an implicit parameter, you can use singletons, and use ArchUnit to enforce your ExecutionContext style.

BatchingDispatcher and ExecutionContext.opportunistic

In addition to ExecutionContext.global and ExecutionContext.parasitic, there's also ExecutionContext.opportunistic, although you'll need to do some work to dig it out.

The official documentation for opportunistic is attached to the global scaladoc under the "Batching short-lived nested tasks" section. You can read the source code if it's easier.

The documentation primarily consists of the following:

ExecutionContext.opportunistic uses the same thread pool as ExecutionContext.global. It attempts to batch nested task and execute them on the same thread as the enclosing task. This is ideally suited to execute short-lived tasks as it reduces the overhead of context switching.

There's a story attached to why opportunistic is not public.

The story begins with the pull request Make the global EC a BatchedExecutor (performance). This PR made the global execution context implement BatchingExecutor, which is what implements this batching behavior.

Then, it was discovered that Nested Future blocks do not run in parallel in Scala 2.13.x, and it was A Thing.

This was unpopular, and the behavior was reverted in Revert ExecutionContext.global to not be a BatchingExecutor. Then ExecutionContext.opportunistic was added as a fallback for the batching behavior, but kept as private[scala] so it wouldn't break binary compatibility. And it remains in that state to this day.

ExecutionContext.opportunistic represents a middle ground between global and parasitic where the execution will use a thread if it has to, but then try to keep everything on that thread. This is similar to the behavior you get if you use an Akka actor's context.dispatcher, although they are still distinct implementations.

object Exercise5 {
  val opportunistic: scala.concurrent.ExecutionContext =
    (scala.concurrent.ExecutionContext: {def opportunistic: scala.concurrent.ExecutionContextExecutor}
      ).opportunistic

  private val logger = org.slf4j.LoggerFactory.getLogger(getClass)

  def slow(key: String): Future[String] = Future {
    logger.info(s"$key start")
    Thread.sleep(1000)
    logger.info(s"$key end")
    key
  }(opportunistic)

  def runAsyncSerial(): Future[Seq[String]] = {
    implicit val ec = opportunistic
    slow("A").flatMap { a =>
      Future.sequence(Seq(slow("B"), slow("C"), slow("D")))
    }
  }

  def main(args: Array[String]): Unit = {
    val f = runAsyncSerial()
    val result = Await.result(f, Duration.Inf)
    logger.info(s"result = $result")
  }
}

This renders all the futures on a single thread.

52:10.651 example.Exercise5$ INFO  [scala-execution-context-global-21]: A start
52:11.653 example.Exercise5$ INFO  [scala-execution-context-global-21]: A end
52:11.667 example.Exercise5$ INFO  [scala-execution-context-global-21]: D start
52:12.668 example.Exercise5$ INFO  [scala-execution-context-global-21]: D end
52:12.669 example.Exercise5$ INFO  [scala-execution-context-global-21]: C start
52:13.669 example.Exercise5$ INFO  [scala-execution-context-global-21]: C end
52:13.670 example.Exercise5$ INFO  [scala-execution-context-global-21]: B start
52:14.671 example.Exercise5$ INFO  [scala-execution-context-global-21]: B end
52:14.677 example.Exercise5$ INFO  [main]: result = List(B, C, D)

Would I recommend using ExecutionContext.opportunistic? Well, no.

Using ExecutionContext.opportunistic implies using ExecutionContext.global. It assumes that you do not have your own dispatchers lying around. It also relies heavily on the correct usage of the blocking construct on any long-running and/or blocking tasks.

If you are using Akka/Pekko (or Play), you'll leverage the default dispatcher, which already implements BatchingExecutor.
If you have defined your own execution contexts, you cannot apply scala.concurrent.BatchingExecutor trait to it as it is private. You cannot make your own opportunistic executor.
If you know you have long running or blocking tasks, it's easier to use a dedicated ThreadPoolExecutor over using the blocking construct – you can see the thread names and manage thread allocation.
If you have short-lived tasks, it's easier to use ExecutionContext.parasitic.

Given all of that, the only appropriate place I can see ExecutionContext.opportunistic being used is in a situation where the application has enough concurrency going on that context switching is a problem, but also not enough design in managing concurrent tasks to define and isolate different execution contexts, while also appropriately wrapping long-running/blocking code in a blocking construct. While also not using Akka/Pekko, or using one of the other concurrency management libraries like Cats IO or Monix.

Echopraxia 3.0

2023-07-04T15:46:03-07:00

TL;DR Echopraxia is a structured logging API for Java and Scala. I've released Echopraxia 3.0 which has a number of new features, most notably more control over presentation logic, custom typed attributes, better exception handling, and removing hardcoded dependencies.

You can check it out at https://github.com/tersesystems/echopraxia/ or check out the new documentation site.

This is going to be a development log going into technical details, explaining the why behind the how.

Presentation Logic

Echopraxia's API is built around structured logging input. In practical terms, that means when you write:

log.info("{} logged in", fb -> fb.user("user", thisUser));

Then you expect to see a line oriented output that looks something like:

INFO user={<some human readable data here>} logged in

And you expect that the JSON output will be:

{
    "level": "INFO",
    "message": "user={<some human readable data here>} logged in",
    "user": {
        <some machine readable data here>
    }
}

The problem: the user may see irrelevant junk inside of user={...} and really only cares about id and role. The machine-readable data also may want JSON in a particular format – Elasticsearch wants stable field names, stable mappings, and has trouble understanding deeply nested objects and arrays. Or there may be additional data that isn't explicitly called out in the field JSON-LD includes a @type field that is used in typed values, i.e. a timestamp may have a type of http://www.w3.org/2001/XMLSchema#dateTime, but that isn't relevant to the human.

The paradox here is that although structured logging involves packaging arguments into a structured format, the presentation of that data is very different between machine-readable format and "ergonomic" human-readable format. While logfmt is a recognizable and compact format, it's still machine based – it does not care about what is most relevant for a human to see. Meanwhile, from the end user's perspective, there's a loss of utility in rendering structured data: they used to be able to control the presentation exactly with toString, and now they can't.

This issue compounds when we start getting into complex objects and arrays. When rendering an AST, batches of paginated data, or encrypted data, there's an issue of presentation. Should the user see the entire AST, or only the relevant bits? Does the user care about the contents of the batch, or just that it's the right length? Should the user see the unencrypted data, or should it be filtered or invisible?

Presentation Hints

The solution in 3.0 is to add typed attributes to fields. These attributes are used to add extra metadata to a field, so that a formatter has more to work with than just the name and value. Then we can add some presentation hints to specialize fields so that a formatter can decide how to render this field in particular, and extend the Field type to PresentationField with some extra methods to provide those hints.

For example, one of the hints is asCardinal, which renders a field as a cardinal number in a line oriented format. This is most useful for very long strings and arrays:

log.debug("{}", fb -> fb.array("elements", 1, 2, 3).asCardinal());

renders as:

elements=|3|

Other useful presentation hints are asElided which will "skip" a field so it doesn't show in the toString formatter, and abbreviateAfter which truncates a string or array after a number of elements.

Structured Format

While it's nice to be able to customize values, there will be cases where the string we want a human to see is not the string we want the machine to see. Take the case of a duration:

log.debug("{}", fb -> fb.duration("duration", Duration.ofDays(1)));

We want to render this in a human readable format:

"1 day"

But we want to see the ISO duration format in JSON:

{"duration": "PT24H"}

Simply rendering a Value.string won't work here, and overriding toString won't be enough. Instead, we have to provide both human and machine values. We can do that by passing a string with the human value, and using withStructuredFormat for the machine value:

public class MyFieldBuilder extends PresentationFieldBuilder {
  public PresentationField duration(String name, Duration duration) {
    Field structuredField = string(name, duration.toString());
    return string(name, duration.toDays() + " day")
            .asValueOnly()
            .withStructuredFormat(new SimpleFieldVisitor() {
              @Override
              public @NotNull Field visitString(@NotNull Value<String> stringValue) {
                return structuredField;
              }
            });
  }
} 

This withStructuredFormat method adds an attribute that takes a FieldVisitor interface, following the visitor pattern. Here, we only care about swapping out the string value, so visitString is all that's required.

This also covers the case where we want to render extra information or do some transformation for the machine, so we could add @type information for JSON-LD:

log.info("{}", fb -> fb.instant("startTime", Instant.ofEpochMillis(0)));

in text format;

startTime=1970-01-01T00:00:00Z

and in JSON:

{
  "startTime": {
    "@type":"http://www.w3.org/2001/XMLSchema#dateTime",
    "@value":"1970-01-01T00:00:00Z"
  }
}

and we can even cover this for the array case:

log.info("{}", fb -> fb.instantArray("instantArray", List.of(Instant.ofEpochMillis(0))));

produces:

{
  "instantArray": [
    {"@type":"http://www.w3.org/2001/XMLSchema#dateTime","@value":"1970-01-01T00:00:00Z"}
  ]
}

And here's the implementation:

public class InstantFieldBuilder implements PresentationFieldBuilder {

  private static final FieldVisitor instantVisitor = new InstantFieldVisitor();

  public PresentationField instant(String name, Instant instant) {
    return string(name, instant.toString()).withStructuredFormat(instantVisitor);
  }

  public PresentationField instantArray(String name, List<Instant> instants) {
    return fb.array(name, Value.array(i -> Value.string(i.toString()), instants))
            .withStructuredFormat(instantVisitor);
  }

  class InstantFieldVisitor extends SimpleFieldVisitor {
    @Override
    public @NotNull Field visitString(@NotNull Value<String> stringValue) {
      return typedInstant(name, stringValue);
    }

    PresentationField typedInstant(String name, Value<String> v) {
      return object(name, typedInstantValue(v));
    }

    Value.ObjectValue typedInstantValue(Value<String> v) {
      return Value.object(
        string("@type", "http://www.w3.org/2001/XMLSchema#dateTime"), keyValue("@value", v));
    }

    @Override
    public @NotNull ArrayVisitor visitArray() {
      return new InstantArrayVisitor();
    }

    class InstantArrayVisitor extends SimpleArrayVisitor {
      @Override
      public void visitStringElement(Value.StringValue stringValue) {
        this.elements.add(typedInstantValue(stringValue));
      }
    }
  }
}

Field Creation

This covers the cases I can think about, but make it really work it has to be extensible so users can add their own custom methods, i.e asDecrypted() to strip decryption from a value. So instead of returning a PresentationField, we should be need to work with a field as a generic user defined type extending Field.

Field creation in Echopraxia comes down to a factory method:

Field field = Field.keyValue(name, value);

This has to be changed so that it takes a Class<T>:

PresentationField field = Field.keyValue(name, value, PresentationField.class);

This also means that users can modify field creation in general, so it can be extended with metrics, validation, caching, etc.

Field Builders

The other change to Echopraxia is the removal of FieldBuilder as a lower bound on the loggers.

Before, you could do the following using Logger<?> and it would act like it was Logger<FieldBuilder>:

Logger<?> logger = LoggerFactory.getLogger(getClass());

This no longer works in 3.0 and the default is PresentationFieldBuilder, not FieldBuilder:

Logger<PresentationFieldBuilder> logger = LoggerFactory.getLogger(getClass());

If you still want to use FieldBuilder or your own custom instance then you can now pass a field builder as argument (instead of having to calling withFieldBuilder):

Logger<MyFieldBuilder> logger = LoggerFactory.getLogger(getClass(), MyFieldBuilder.instance());

There were two justifications initially for using Logger<FB extends FieldBuilder>: minimizing verbosity and providing a minimal set of functionality for building and extending fields. The first justification is weak, and the second is offset by the assumptions that FieldBuilder makes for the user.

Minimizing Verbosity

At the time that Echopraxia was first sketched out, JDK 1.8 was much more popular. This is no longer the case – JDK 11 is long in the tooth now, and JDK 17 is the standard. The language has evolved, and now has type inference.

This means that if you're concatenating loggers in a method, you'll use var:

public class Foo {  
  public void doStuff(Instant startTime) {
    var log = logger.withFields(fb -> fb.instant("startTime", startTime))
    log.info("doStuff: make things happen");
  }
}

And if you're defining a static final logger, you're not going to be bothered because private static final is already an up front cost:

public class Foo {  
  private static final Logger<PresentationFieldBuilder> logger = 
    LoggerFactory.getLogger(Foo.class);
}

So this is a moot point.

Hidden Assumptions

The other problem with <FB extends FieldBuilder> is that the FieldBuilder interface makes too many assumptions about what the statement writer should know, instead of putting that power in the hands of the developer writing the field builder.

Let's bring it down to a single statement:

log.info("{}", fb -> fb.string("operation", "add"));

I'm still happy with the requirement of an fb handle for constructing arguments here. Ideally, I'd like a magic static import function for the handle:

log.info("{}", import _ -> string("operation", "add"));

Or some kind of magic tuple:

log.info("{} {}", "startTime" -> startTime, "endTime" -> endTime);

But for what Java is, adding an fb. prefix to everything is fine.

Everything after the fb. is not fine, because it makes three different assumptions.

Exposing Field

The first assumption the FieldBuilder makes to expose Field as the return type instead of <F extends Field>.

I've already gone over the problem with hardcoding Field, but it's worth noting because this is baked into the Logger itself. There's no way a developer can swap that out using withFieldBuilder – it's part of the public API.

Exposing Primitives

The second assumption that FieldBuilder makes is to expose the infoset primitives (string, boolean, number), and the infoset complex (array, object) as part of the API, and it also exposes keyValue and value for the underlying Value objects.

This is a problem on multiple levels. It puts the power of definition in the statement, rather than in the field builder. More importantly, it ties the hands of the developer because what's being passed in is a representation, rather than the object to represent.

Simply put, there's no such thing as a string. A string is a textual representation of something meaningful in the domain: a name, an address, a debug representation of a syntax tree. A boolean is a representation of a feature flag, and so on.

So rather than letting the user pass in a string:

log.info("{}", fb -> fb.string("operation", "add"));

The developer could require the user to categorize the data as "program flow":

log.info("{}", fb -> fb.flow("operation", "add"));

or even better, expose a DSL:

log.info("{}", fb -> fb.flow.operation("add"));

The point here is not that this is an ideal DSL, but that the developer should be able to decide how permissive or restrictive the field builder API is. Specifying that the logger extends FieldBuilder is removing that choice.

Exposing Names

The third assumption is that users should provide field names – that is, fb.kv("name", "value") is reasonable.

This seems like a reasonable assumption, especially when you're trying to log several instances of the same type:

log.info("{} {}", fb -> fb.list(
  fb.instant("startTime", startTime), 
  fb.instant("endTime", endTime)
));

However, there are downsides to defining names directly in statements, especially when using centralized logging.

The first issue is that you may have to sanitize or validate the input name depending on your centralized logging. For example, ElasticSearch does not support field names containing a . (dot) character, so if you do not convert or reject invalid field names.

A broader issue is that field names are not scoped by the logger name. Centralized logging does not know that in the FooLogger a field name may be a string, but in the BarLogger, the same field name will be a number.

This can cause issues in centralized logging – ElasticSearch will attempt to define a schema based on dynamic mapping, meaning that if two log statements in the same index have the same field name but different types, i.e. "error": 404 vs "error": "not found" then Elasticsearch will render mapper_parsing_exception and may reject log statements if you do not have ignore_malformed turned on.

Even if you turn ignore_malformed on or have different mappings, a change in a mapping across indexes will be enough to stop ElasticSearch from querying correctly. ElasticSearch will also flatten field names, which can cause more confusion as conflicts will only come when objects have both the same field name and property, i.e. they are both called error and are objects that work fine, but fail when an optional code property is added.

Likewise, field names are not automatically scoped by context. You may have collision cases where two different fields have the same name in the same statement:

logger
  .withFields(fb -> fb.keyValue("user_id", userId))
  .info("{}", fb -> fb.keyValue("user_id", otherUserId));

This will produce a statement that has two user_id fields with two different values – which is technically valid JSON, but may not be what centralized logging expects. And the backend should be able to deal with this transparently to make it work.

In short, what the user adds as a field name should be more what you'd call a 'guideline' than an actual rule, and the field builder should provide defaults if the user doesn't provide one.

For example, if you specify an Address without a name, even if you're just overriding keyValue:

log.info("{}", fb -> fb.keyValue(address));

Then the field builder can use address as the default field name, or even query the Address to figure out if it's home or work.

Exception Handling

The internals of Echopraxia's error handling have also been improved.

One assumed rule about logging is that it shouldn't break the application if logging fails. This has never been true, and SLF4J makes no such promises:

class MyObject {
  public String toString() {
    throw new Exception();
  }
}
slf4jLogger.info("{}", new MyObject());

This will throw an exception that will blow through an appender in Logback, because it calls toString. It's up to the application to wrap arguments in a custom converter.

The problem in Echopraxia is that it can be astonishingly hard to figure out what values are null.

For example, in the AWS Java API, you'll get an S3Object:

class MyFieldBuilder extends PresentationFieldBuilder {
  public Value<?> s3ObjectValue(S3Object s3Object) {
    String key = s3Object.getKey();
    int taggingCount = s3Object.getTaggingCount().toInt();
    return Value.object(keyValue("key", key), keyValue("taggingCount", taggingCount));
  }
}

This will fail with a NullPointerException, because getTaggingCount() returns a null Integer.

Echopraxia makes a best effort and does point out bad ideas, but now there's an actual exception handler for when things go south. The default implementation writes e.printStackTrace() to STDERR and does not log the statement, but you can replace that implementation with something that writes to Sentry or writes a "best effort" log statement out.

Housekeeping

Another housekeeping arrangement is moving the internals of Logger and logging support out of the api package into an spi package.

There's two categories of user: the developer who puts together the loggers, field builders, and the actual framework itself, and then the end users who just want the log statements and maybe some custom conditions. The code needs to reflect that.

My codified knowlede of SPI vs API is spotty, but the general rule I applied was "does the person writing a logging statement need to know about this" – things like Condition, LoggingContext and Level – all qualify as API. More internal things like CoreLogger, Filter, and DefaultMethodsSupport do not.

Compile-Only Dependencies

Finally, 3.0 removes the transitive dependencies on logback and log4j2 framework implementation, so they will need to be explicitly defined as dependencies at installation.

This is mostly because SLF4J 2.0.x and SLF4J 1.7.x do not mix. If you have code that uses SLF4J 2.0.x, it will see Logback 1.2.x and pointedly ignore it with a warning message.

This didn't used to matter, because logstash-logback-encoder used SLF4J 1.7.x, but as of 7.4 support for Logback 1.2 is dropped. Echopraxia makes no guarantees of backwards or forward compatibility between versions: it is best effort at this point. If it gets to the point where Logback 1.2 and 1.4 are simply not reconcilable, I'll create different adapters for them so there'll be logstash1_2 and logstash1_4 backends.

Likewise, while Log4J 2 doesn't depend on SLF4J 2, the Log4J vulnerability warnings and general attention given to analyzing exactly which version of Log4J 2 your framework depends on means that the safest thing to do is to not specify any version at all.

Summary

I hope this gives a good overview of the design decisions and thinking going into this release. I'm really happy with Echopraxia as a whole, and I keep being surprised at how much fun I have both writing it and finding new things I can do with it.

Bootstrapping Boxes Into Tailscale With 1Password

2023-05-25T19:31:39-07:00

This is a follow-on from Disposable Cloud Environments With Vagrant and Tailscale. The summary is that I've worked out how to get new boxes up and integrated with Tailscale with a small bootstrap Ansible playbook and some 1Password integration.

This is going to be short and direct, with the goal of showing how to repeat this and never have to think about how to manage secrets. Credit to kaushikchandrashekar/developer-vagrant for shortcutting much of this process with a github project showing Vagrant leveraging Ansible Galaxy.

The source code is available at https://github.com/wsargent/vagrant-tailscale-example.

The Problem

The previous post used Vagrant's inline script to set up Tailscale and everything else:

Vagrant.configure("2") do |config|
  config.env.enable

  # vm parameters
 
  config.vm.provision "tailscale-install", type: "shell" do |s|
    s.inline = "curl -fsSL https://tailscale.com/install.sh | sh"
  end
   
  config.vm.provision "tailscale-up", type: "shell" do |s|
    s.inline = "tailscale up --ssh --operator=vagrant --authkey #{ENV['TAILSCALE_AUTHKEY']}"
  end

  # ...yet more script...
end 

Inline scripts in Vagrant aren't great. Every Vagrantfile is different, and the failure behavior is unpredictable. In addition, doing the work of op run -- vagrant up to integrate with 1Password CLI was awkward.

Let's take the opposite approach. Bootstrap Vagrant into a Tailscale host with as little manual work as possible, then provision using Ansible through Tailscale, bypassing Vagrant.

Vagrant with Ansible

The first step was to use the Ansible Provisioner in Vagrant, and put as few things into the Vagrant file as possible.

Vagrant.configure("2") do |config|   
    config.vm.box = "ubuntu/jammy64"
    config.vm.hostname = "vagrant-docker"
    config.vm.provision :ansible do |ansible|
        ansible.compatibility_mode = "2.0"
        ansible.playbook = "playbook.yml"
        ansible.galaxy_role_file = "requirements.yml"
        ansible.galaxy_roles_path = "/etc/ansible/roles"
        ansible.galaxy_command = "sudo ansible-galaxy install --role-file=%{role_file} --roles-path=%{roles_path} --force"
    end

    config.trigger.before :destroy do |trigger|
        trigger.run_remote = {inline: "tailscale logout"}
        trigger.on_error = :continue
    end
end

Python package management is slightly terrifying, so I opted for apt to install Ansible:

$ sudo add-apt-repository --yes --update ppa:ansible/ansible
$ sudo apt install ansible

Using Ansible Galaxy

There are two Ansible packages I needed to get Tailscale set up: artis3n.tailscale and community.general.onepassword.

These can be set up in requirements.yml:

---
roles:
  - name: artis3n.tailscale    

collections:
  - name: community.general

It's easiest to install these pre-emptively rather than have it pop up in the middle of the install:

$ ansible-galaxy install artis3n.tailscale
$ ansible-galaxy collection install community.general

Running the Playbook

After the packages are installed, the only thing needed in the playbook is to set up the tailscale role and look up the authkey from 1Password using community.general.onepassword:

---
- hosts: all
  become: true
  tasks:
    - name: install-tailscale
      import_role: 
        name: artis3n.tailscale
      vars:
        tailscale_authkey: "{{ lookup('community.general.onepassword', 'vagrant-tailscale', field='credential', vault='will-connect-vault') }}"
        tailscale_args: "--ssh"

I did have to go into Tailscale and change the SSH Check mode from action: check to action: accept so it didn't keep asking me to click on URLs.

The only thing I need to do is make sure I'm signed into 1Password, and then after that the host will pop up:

$ eval $(op signin) # sign into 1p CLI
$ vagrant up
# ...tailscale status shows new host on tailnet...

Integrating Tailscale with Ansible

From there, it's now a question of how to install software other than Tailscale on the box. Ansible can do it, but first Ansible has to know about it.

The first thing to do is set up dynamic inventory with Tailscale using the Tailscale Inventory Plugin:

$ ansible-galaxy collection install freeformz.ansible

And then the configuration in ansible.cfg:

[inventory]
enable_plugins = freeformz.ansible.tailscale

[defaults]
inventory = $HOME/tailscale.yaml
remote_user = vagrant
host_key_checking = False

[ssh_connection]
pipelining=true
retries=10

And now the vagrant boxes can get installs the same way that any other host would. For example, to install Docker:

$ ansible-playbook playbooks/docker.yml

Where docker.yml contains:

- name: Configure with Docker
  hosts: vagrant-docker
  become: true
  tasks:
    - name: apt-update
      apt:
        update_cache: yes
    - name: install-docker
      import_role:
        name: geerlingguy.docker
      vars:
        docker_edition: 'ce'
        docker_package: "docker-"
        docker_package_state: present
        docker_install_compose: true
        docker_users:
          - vagrant

Next Steps

I am aware of Ansible Vault but want to use 1Password in part because the next step is to extend it to 1Password Connect for use with Kubernetes and Terraform. With any luck, this should make working with API keys and tokens much easier – copy and paste them into 1Password and be done. And then, just maybe, never think about secrets management again.

Disposable Cloud Environments With Vagrant and Tailscale

2023-05-03T14:10:21-07:00

There's a lot in this blog post, so I'll summarize it first, and then tell you a horrible joke that got me a content warning from Slack (but it won't make sense unless you're a functional programming nerd).

Goal #1: I want to build out an ELK cluster and Do Science To it.
Goal #2: I want to not start from scratch or figure out how to undo things when I screw up.
Goal #3: I want to be able to keep working on it from different computers.
Goal #4: I want to work through Kubernetes the Hard Way and set up a cloud environment.

The solution to #1 is containerization. Run Docker Compose and set up an multi node ElasticSearch cluster.

The solution to #2 is virtualization. Create a virtual machine using Vagrant, install Docker on it, then do #1. Now I can take VM snapshots before config changes and rollback if I screwed up, and I can dispose of the boxes when I'm done.

The solution to #3 is to build out a homelab server (incredibly cheap at $391). Install Ubuntu, then #2.

The solution to #4 is to throw more memory at the homelab server, then install Kind after #2. Because of #2, I can now also mess with Terraform state and get away with it.

Pause to build and install everything…

Problem: I want to see Kibana from my laptop browser. Docker Compose forwards everything to localhost, and then the VM also requires networking magic to expose it to the host. I have a box containing a box, containing a box, and I don't want to have to port forward all the things.

Solution: Install Tailscale on the VM, exposing it as a host on the network (tailnet in Tailscale parlance).

Problem: Kubernetes is an orchestration layer, so now there are many boxes and portforwarding is impossible.

Solution: Set up Tailscale as a subnet router inside the VM, using Escaping the Nested Doll with Tailscale as a guide. Now I have infinite hosts on the network, and if I want a different configuration I can rollback to a base k8s state, or even set up them up side by side.

Now I'm going to tell you the horrible joke.

Imagine that we're defining processes as code, so a physical server is a container for processes: `

// http://devserver:80 from nginx -p 80 (the trivial case)
val devServer: Server = Server(Process("nginx", 80))

We can describe Docker and VirtualMachine the same way:

// docker run -p 80:80 nginx
val docker: Docker = Docker(Process("nginx", 80), PortForward(guest = 80, host = 80))

// config.vm.provision "shell", inline: "nginx -p 80"
// config.vm.network "nginx-port", guest: 80, host: 80
val standardVM: VirtualMachine = 
  VirtualMachine(Process("nginx", 80), PortForward(guest = 80, host = 80))

And we can build this up by putting Docker inside of a VM:

// config.vm.provision "shell", inline: "docker run -p 80:80 nginx"
// config.vm.network "nginx-port", guest: 80, host: 80
val vagrantDocker: VirtualMachine = 
  VirtualMachine(docker, PortForward(guest = 80, host = 80)))

And see how Kubernetes is a bit different:

// podIP: 10.244.0.6:80, 10.244.0.7:80 
val kubernetes: Kubernetes = Kubernetes(
  Set(
    Pod(Process("nginx", 80)),
    Pod(Process("nginx", 80))
  )
)

// Port mapping breaks down when we have multiple pods on a single VM :-(
val vagrantKubernetes: VirtualMachine = 
  VirtualMachine(kubernetes, ???)

From this, we can infer that Server, Docker and Pod are all Container types:

trait Container[T]
trait Server extends Container[Set[Process]]
trait Docker extends Container[Set[Process]] with Process
trait Pod extends Container[Set[Process]] with Process

And that VirtualMachine and Kubernetes are also instances of Container:

trait VirtualMachine extends Container[Server]
trait Kubernetes extends Container[Set[Pod]] extends Process

And Tailscale creates a Server from VirtualMachine:

val vagrantDocker = VirtualMachine(docker, portMapping))

// http://vagrant-docker:80 on the tailnet
val exposedNginxHost: Server = Tailscale(vagrantDocker)

But if VirtualMachine is a Container[Container[Set[Process]]] and Server is a Container[Set[Process]]:

val Tailscale: Container[Container[Set[Process]]] => Container[Set[Process]]

Tailscale is flatMap for the Containerization monad.

Still here? Let's dig into the details and I'll show all the setup steps.

Putting The Machine Together

Developing on my laptop has issues.

I installed elementary 5 on my laptop a while ago. It's based on Ubuntu 18.04, and I've started running into "version GLIBC_2.28' not found" errors more and more as it gets further behind. There's no way to upgrade Elementary 5 – the upgrade path is to reinstall the operating system from scratch. And… well, it has all kinds of cruft on it from various docker/k8s/cluster management tools all over it. It works fine as a laptop, but as a development environment it's not great. And trying to use Windows with WSL2 was even worse.

The easiest thing to do – obviously – is to take a week off work, put together a cheap headless machine as a homelab server, stick it in the basement, move everything to that box, and then connect the laptop remotely.

I went with the Ars Technica System Guide base specs, with a couple of changes: I added 64GB of memory, and I picked out an AMD Ryzen 5 5600X instead of the 5600G. (This was a mistake – the 5600X doesn't have an integrated GPU, leading to a frantic moment trying to figure out why the BIOS wouldn't come up on the HDMI port.) After upgrading the BIOS, staring at the manual for pins, and enabling virtualization by turning on SVM Mode, it was finally ready for a minimal Ubuntu install, using xrdp and Remmina to connect remotely.

I named it devserver.

Using Tailscale for "Server in the Basement"

The first thing to do was to install Tailscale on absolutely everything and enable every single feature, especially DNS.

Tailscale is good at the core use case, but does have some client based issues. For WSL, it won't recognize the tailscale client in the main Windows app, so you have to run tailscaled explicitly and distinguish it from the Windows host:

sudo nohup tailscaled &
sudo tailscale up --hostname windows-wsl

With the laptop, I also sometimes had to do tailscale down/up or --reset in order to get the mappings to resolve correctly.

There's a couple of things to be aware of when setting up Tailscale for a server. The first one is disabling key expiry for the server, since it's going to be hanging around for a while. The second is that tailscale provides its own ssh, which requires its own parameters:

sudo tailscale up --ssh --operator=$USER

Once SSH was up, it was time to futz with configuration files. I like to use Visual Studio Code with SSH remote development, which comes with a secret command line tool for connecting to any host on the tailnet:

code --folder-uri "vscode-remote://ssh-remote+devserver/home/wsargent/"

There are also some utilities that Tailscale provides for ad-hoc port forwarding. For example, I can run jekyll serve on devserver and it will start a server on port 4000 – I can see how that looks on my phone by using tailscale serve:

sudo tailscale serve tcp:4000 tcp://localhost:4000

And then I can go to http://devserver:4000 on my phone and see how the blog post looks from there. The services page on Tailscale shows a list of ports open on all the machines, so it's easy to see what services are active and how to get to them.

Adding Tailscale to Vagrant

Adding Tailscale to Vagrant is straightforward. Generate an authentication key, make it reusable, and save it into 1Password for provisioning. 1Password has a CLI that's very useful in managing secrets – this blog post is already too long, but there's an example repository that shows how to provision secrets.

I started off with Virtualbox, but have been experimenting with libvert. To use libvert, add vagrant-libvert plugin:

sudo apt install libvert-dev
sudo apt-get purge vagrant-libvirt
sudo apt-mark hold vagrant-libvirt
sudo apt-get update 
sudo apt-get install -y qemu libvirt-daemon-system ebtables libguestfs-tools vagrant ruby-fog-libvirt

Using vagrant-env plugin, you can then set up Tailscale on startup and shutdown:

Vagrant.configure("2") do |config|
  config.env.enable

  config.vm.box = ENV['VM_BOX']
  config.vm.hostname = ENV['VM_HOSTNAME']    
  config.vm.provider ENV['VM_ENGINE'] do |v| 
    v.name = ENV['VM_HOSTNAME']
    v.memory = ENV['VM_MEMORY']
    v.cpus = ENV['VM_CPUS']
  end

  config.vm.provision "tailscale-install", type: "shell" do |s|
    s.inline = "curl -fsSL https://tailscale.com/install.sh | sh"
  end

  config.vm.provision "tailscale-up", type: "shell" do |s|
    s.inline = "tailscale up --ssh --operator=vagrant --authkey #{ENV['TAILSCALE_AUTHKEY']}"
  end

  config.trigger.before :destroy do |trigger|
    trigger.run_remote = {inline: "tailscale logout"}
    trigger.on_error = :continue
  end

end

To start it, I run vagrant up. To stop it, I run vagrant halt. When I'm done experimenting with the environment, I destroy it with vagrant destroy, and it removes itself from the tailnet automatically.

Adding Docker to Vagrant

Now we have to add Docker to Vagrant:

Vagrant.configure("2") do |config|
  # ...

  config.vm.provision "docker-install", type: "shell" do |s|
    s.inline = <<-SCRIPT
curl -fsSL https://get.docker.com -o get-docker.sh &&
sudo sh get-docker.sh &&
sudo adduser vagrant docker
SCRIPT
  end

end

Now we can have some fun – we'll run two docker compose instances side by side without port conflicts. Checkout awesome-compose so it shows up on the /vagrant mount:

$ vagrant ssh
$ cd /vagrant/awesome-compose/nginx-golang
$ docker compose up

And then again, only this time we have vagrant-nginx-nodejs-redis as the hostname:

$ vagrant ssh
$ cd /vagrant/awesome-compose/nginx-nodejs-redis
$ docker compose up

Now we've got two nginx instances, both running on port 80 – but they just appear as different hosts.

$ curl vagrant-nginx-nodejs-redis
$ web1: Number of visits is: 1

and

$ curl vagrant-nginx-golang

          ##         .
    ## ## ##        ==
 ## ## ## ## ##    ===
/"""""""""""""""""\___/ ===
{                       /  ===-
\______ O           __/
 \    \         __/
  \____\_______/


Hello from Docker!

I also have a vagrant-docker box that I use for ad-hoc installations. From the laptop, I can install only the docker CLI and set DOCKER_HOST to point the box:

sudo apt install docker-ce-cli
export DOCKER_HOST=ssh://vagrant@vagrant-docker
ssh vagrant@vagrant-docker # or tailscale ssh vagrant@vagrant-docker
docker ps # will work after ssh succeeded!

And now I can run various services directly and hit them at http://vagrant-docker:3000.

Disposable Cloud Environments

The limitation of using Docker Compose is that you're still referencing the Vagrant box, and picking out a service by port. If you have a more complex environment, you'll probably have several databases, a key/store server, several microservices and so on. Really, you'd like to be able to spin up Kubernetes inside a Vagrant Box and access all the pods through Tailscale automatically.

Setting up Kubernetes itself is suprisingly simple in a Vagrantfile. For example, setting up Kind is as simple as a vagrant up:

Vagrant.configure("2") do |config|

  # ...install Docker

  config.vm.provision "kind", type: "shell" do |s|
    s.inline = <<-SCRIPT
curl -Lo ./kind "https://kind.sigs.k8s.io/dl/v0.18.0/kind-$(uname)-amd64"
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
SCRIPT
  end

  config.vm.provision "kubectl", type: "shell" do |s|
    s.inline = "sudo snap install kubectl --classic"
  end

  config.vm.provision "kubectl-completion", type: "shell" do |s|
    s.inline = 'echo "source <(kubectl completion bash)" >> ~/.bashrc'
  end

end

Setting up Tailscale in Kubernetes is… not so simple. This took some trial and error, and I leaned heavily on Escaping the Nested Doll with Tailscale when going through this.

I like to start fresh so I immediately wipe the cluster to clean everything:

$ kind delete cluster
$ kind create cluster

Tailscala's Kubernetes subnet routing section is a bit confusing and out of date, so I used the README.md in https://github.com/tailscale/tailscale/tree/main/docs/k8s which is slightly different.

First we need to set up with an auth key and write it as a k8s service account token secret and pass it through:

$ kubectl apply -f - <<EOF apiVersion: v1
kind: Secret
metadata:
  name: tailscale-auth
stringData:
  TS_AUTHKEY: <your-auth-key>
EOF

Then we checkout the github project and get to the Makefile:

$ git clone https://github.com/tailscale/tailscale
$ cd tailscale/docs/k8s

And execute make rbac:

$ sudo apt install make
$ export SA_NAME=tailscale
$ export TS_KUBE_SECRET=tailscale-auth
$ make rbac

Next, we want to set up a subnet router. We need pod and service IP addresses. We can set up an nginx instance from the k8s docs:

$ kubectl apply -f https://k8s.io/examples/application/deployment.yaml

From there, we can see the IP address of the pods:

$ kubectl get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
nginx-deployment-85996f8dbd-6clrn   1/1     Running   0          5h34m   10.244.0.6   kind-control-plane   <none>           <none>
nginx-deployment-85996f8dbd-jwk4q   1/1     Running   0          5h34m   10.244.0.5   kind-control-plane   <none>           <none>

and the service:

$ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   5h35m

Then we set the TS_ROUTES and call make subnet-router:

$ SERVICE_CIDR=10.96.0.0/16
$ POD_CIDR=10.244.0.0/15
$ export TS_ROUTES=$SERVICE_CIDR,$POD_CIDR
$ make subnet-router
pod/subnet-router created

And finally I can see the subnet router defined in Tailscale:

$ tailscale ping subnet-router
pong from subnet-router (100.96.237.125) via DERP(sfo) in 26ms

Next, we need to go to the machines page and it will have a little alert next to it saying "Unapproved subnet routes!" Go to the "Edit routes settings" menu option and click Approve All.

And now the pods are accessible via IP address through Tailscale and I can see them from my Windows machine and my iPhone:

C:\Users\wsargent>curl 10.244.0.6
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Note that the pods do not show up in tailscale status, and they are not accessible by pod name – only IP address. There is a way to hook Tailscale into the k8s DNS, but I haven't dug into it. I'll abstract this into a Vagrantfile eventually, but this is a good place to stop.

Further Work

This approach works out of the box, but could use some optimization.

I need to set up a Docker registry to cache images, and an apt cache so that initializing Vagrant boxes doesn't go over the network. It might make sense to have them run on devserver specifically so they don't have to rely on specific vagrant instances being up (and I can't wipe them out by accident).

I could also make a Vagrant basebox with Tailscale and Docker and remove that bit from initialization.

Dynamic Logging With Ammonite and JBang

2023-03-12T17:29:33-07:00

It was fun putting together a proof of concept, so I've been trying to see what the smallest possible dynamic logging demo can be.

Ammonite and JBang are tools that internalize the work of dependency management and build options so that you can run a script without going through a build tool. This is really useful if you want to just download and play with something, because you can just copy and paste text into a file and be done.

Here's the github repo.

Ammonite

First the Ammonite script, you can run this using amm script.sc:

import $ivy.{        
  `com.tersesystems.echopraxia.plusscala::logger:1.1.2`,
  `com.tersesystems.echopraxia:scripting:2.2.4`,
  `com.tersesystems.echopraxia:logstash:2.2.4`,
  `com.tersesystems.logback:logback-classic:1.2.0`,
  `com.lihaoyi::os-lib:0.9.1`
}

import com.tersesystems.echopraxia.plusscala._
import com.tersesystems.echopraxia.plusscala.api._
import com.tersesystems.echopraxia.scripting._
import com.tersesystems.logback.classic.ChangeLogLevel

case class ScriptService(dir: os.Path) {
  private val sws = new ScriptWatchService(dir.toNIO);
  
  def condition(path: os.Path) = {
    val scriptHandle = sws.watchScript(path.toNIO, _.printStackTrace)
    ScriptCondition.create(scriptHandle).asScala
  }
}

object TweakFlow {
  val default = """
    |library echopraxia {
    |  function evaluate: (string level, dict ctx) ->
    |    let {
    |      find_string: ctx[:find_string];
    |    }
    |    find_string("$.foo") == "bar";
    |}  
  """.stripMargin
}

@main
def main() = {
  // No logback.xml, we're doing it live       
  val changer = new ChangeLogLevel
  changer.changeLogLevel("ROOT", "INFO")
  val logger = LoggerFactory.getLogger
  changer.changeLogLevel(logger.name, "DEBUG")

  // Ensure a script exists and is watched
  val dir = os.pwd
  val service = ScriptService(dir)
  val tweakflowFile = dir / "tweakflow.tf"
  if (! os.isFile(tweakflowFile)) {
    os.write(tweakflowFile, TweakFlow.default)
  }

  // now we're sure the file exists, set up a condition and run in a loop.
  val condition = service.condition(tweakflowFile)
  while (true) {
    try {
      logger.debug(condition, "{}", fb => fb.keyValue("foo" -> "bar"));
    } finally {
      Thread.sleep(2000L);
    }
  }
}

JBang

And now JBang. Be sure you have something in JAVA_HOME or else it will automatically download and install a JDK itself. I usually point it at JDK 17 using the jbang jdk feature:

jbang jdk install 17 `sdk home java 17.0.4.1-tem`

Here's the script: this works on a JDK 13 or above JVM because it uses text blocks:

///usr/bin/env jbang "$0" "$@" ; exit $?
//DEPS com.tersesystems.echopraxia:logger:2.2.4
//DEPS com.tersesystems.echopraxia:logstash:2.2.4
//DEPS com.tersesystems.echopraxia:scripting:2.2.4
//DEPS com.tersesystems.logback:logback-classic:1.2.0

import com.tersesystems.echopraxia.*;
import com.tersesystems.echopraxia.api.*;
import com.tersesystems.echopraxia.scripting.*;
import com.tersesystems.logback.classic.ChangeLogLevel;

import java.nio.*;
import java.nio.file.*;

public class Script {
    private static final Logger<?> logger = LoggerFactory.getLogger(Script.class);

    private static final String defaultScript = """
        import * as std from "std";
        alias std.strings as str;
        
        library echopraxia {
          function evaluate: (string level, dict ctx) ->
            let {
              find_string: ctx[:find_string];
            }
            str.lower_case(find_string("$.foo")) == "bar";   
        }        
        """;

    public static void main(String... args) throws java.io.IOException {
        ChangeLogLevel changer = new ChangeLogLevel();
        changer.changeLogLevel("ROOT", "INFO");
        changer.changeLogLevel(logger.getName(), "DEBUG");

        Path watchedDir = Paths.get(".");
        ScriptWatchService watchService = new ScriptWatchService(watchedDir);
        Path filePath = watchedDir.resolve("tweakflow.tf");

        if (! Files.exists(filePath)) {
            Files.writeString(filePath, defaultScript);
        }

        ScriptHandle watchedHandle = watchService.watchScript(filePath, e -> logger.error("Script compilation error", e));
        Condition condition = ScriptCondition.create(watchedHandle);

        logger.info(condition, "{}", fb -> fb.string("foo", "BAR"));
    }
}

And that's it!

Blindsight vs Echopraxia

2023-03-12T15:39:50-07:00

Very small blog post as I am noodling and working on a new release of Blindsight.

I've said before that Blindsight "supports" structured logging. Echopraxia "requires" structured logging. It occurs to me that this is really a bit backwards.

Structured logging typically talks about the output of logging: mapping whatever data you have in the logging event into JSON. But this doesn't talk about the inputs – how you provide the JSON with something to chew on. More accurately, we should call most structured logging "structured output" because the output is structured even when most of the input isn't.

So what is a structured input? A structured input is a reliable key/value pair, where the key is typically a string.

For a long time in SLF4J, MDC was the only way to reliably establish a key/value pair in SLF4J, but it wasn't complete because it could only take a string as the value. Then logstash-logback-encoder added event specific custom fields, but the value was still java.lang.Object and it is not a consistent structure – for example, you can't specify a StructuredArgument as the value of another StructuredArgument, and building up a complex semi-structured object is not possible.

Blindsight gives you the option of providing structured input using an Argument with DSL, and does have a consistent structure. But Blindsight doesn't require that of you. You can mix and match structured and unstructured input, and it's fine:

import com.tersesystems.blindsight.DSL._
logger.info("unstructured = {} structured = {}", "string", bobj("instant" -> Instant.now))

Echopraxia requires all input to have structure, by converting input into Field instances through a FieldBuilder and instead of varadic arguments, there's a FieldBuilder => FieldBuilderResult function. For the Scala API, it looks like this:

logger.debug("{}", _.keyValue("foo" -> "bar"))

So why require structured input?

The big answer is that structured input is valuable for developing in the large. By and large, structured formats are the norm in any kind of service: Protobuf, Avro, Parquet, HTTP parameters, and so on. Being able to carry structure over and through into logging adds coherence and allows logging-specific serialization of complex objects.

The more detailed answer is that once you can rely on structured input, you can query and filter your log events vastly more effectively. You can also choose how to render fields in the event, not just in JSON but also for line oriented encoders. For example, you can say %fields{$.request_id} and render only request_id value using a custom converter with a pattern encoder in Logback:

<configuration>
    <conversionRule conversionWord="fields" converterClass="com.tersesystems.echopraxia.logstash.FieldConverter"/>    
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>
                %-4relative [%thread] %-5level [%fields{$.request_id}] %logger - %msg%n
            </pattern>
        </encoder>
    </appender>

    <root level="DEBUG">
        <appender-ref ref="STDOUT"/>
    </root>
</configuration>

Can you do this with unstructured input, or with a mix of structured and unstructured input? Sort of. Imagine your input is a list of random Object with the only real guarantee that toString will return a String. If you are given an object that contains an array, how do you query and filter on that component? You have to explictly cast to the type, and then query on it. This is very difficult to do from inside a Logback filter, which is usually kept apart from the domain classes.

So my argument is that it's a trade off. Blindsight takes a permissive approach and requires type safety, but does not require structure. Echopraxia takes a stricter approach and requires both type safety and structure, with more control over output.

Ad-hoc structured log analysis with SQLite and DuckDB

2023-03-04T09:01:54-08:00

Structured logging and databases are a natural match – there's easily consumed structured data on one side, and tools for querying and presenting data on the other. I've written a bit about querying structured logging with SQLite and the power of data science when it comes to logging by using Apache Spark. Using SQL has a number of advantages over using JSON processing tools or log viewers, such as the ability to progressively build up views while filtering or querying, better timestamp support, and the ability to do aggregate query logic.

But structured logging isn't what most databases are used to. The de-facto standard for structured logging is newline-delimited JSON (NDJSON), and there is only a loose concept of a "schema" – structured logging can have high cardinality, and there's usually only a few guaranteed common fields such as timestamp, level and logger_name. Getting an actual schema so that you can get NDJSON into a database is still a somewhat manual process compared to CSV. Spark is great at NDJSON dataframes, but Spark is a heavyweight solution that we can't just install on a host. What we really want is an in-process "no dependencies" database that understands NDJSON.

TL;DR: With NDJSON support, slurping structured logs into a "no dependencies" database like SQLite or DuckDB is easier than ever.

sqlite-lines

Alex Garcia released sqlite-lines in June specifically to read NDJSON.

Using sqlite3 can be more convenient than using jq or other JSON processing command line tools for digging around in logs. Adding the sqlite-lines extension is as simple as getting the static library:

$ wget https://github.com/asg017/sqlite-lines/releases/download/v0.1.1/lines0.so
$ sqlite3
sqlite> .load ./lines0

Processing can be done with the lines_read function which provides a table with a column line:

sqlite> select line from lines_read('application.json') limit 1;

This will produce JSON output like:

{
  "id": "FtYkeclkrh8dHaINzdiAAA",
  "relative_ns": -295200,
  "tse_ms": 1645548338725,
  "start_ms": null,
  "@timestamp": "2022-02-22T16:45:38.725Z",
  "@version": "1",
  "message": "Database [logging] initialized",
  "logger_name": "play.api.db.DefaultDBApi",
  "thread_name": "play-dev-mode-akka.actor.default-dispatcher-7",
  "level": "INFO",
  "level_value": 20000
}

We don't want to call lines_read all the time, so we'll import into a local table:

create table logs as select line from lines_read('application.json');

Combined with the jpointer operators added in 3.38.0, we can filter using JSONPath:

select line from logs where line->>'$.level' = 'ERROR';

This produces a JSON result that contains a giant stacktrace, and I only want the message. What makes SQLite so effective as a query tool is that it's very easy to progressively stack views to get only the data I want:

sqlite> create view shortlogs as 
  select line->'$.@timestamp' as timestamp, line->'$.message' as message, line->>'$.level' as level
from logs;
sqlite> select * from shortlogs where level = 'ERROR';
"2022-02-22T16:45:50.900Z"|"Internal server error for (GET}) [/flaky}]"|ERROR

Saving the table and exporting it to your local desktop is also very simple, and gives you the option of using a database GUI like DB Browser for SQLite.

Interestingly, sqlite-lines can be used with Datasette with datasette data.db --load-extension ./lines_nofs0 which would provide a web application UI for SQLite, but I haven't tried this.

DuckDB

SQLite does have some disadvantages in that it processes rows sequentially, and so asking it aggregate or analytical questions like "what are the 10 most common user agent strings" can take a while on large datasets. DuckDB is like SQLite, but focused on analytics – it focuses on processing entire columns at once, rather than a row at a time.

I haven't used duckdb extensively as it requires that a schema is defined before you import. That's not a problem with the new version: as of DuckDB 0.7.0, DuckDB can read NDJSON files and infer a schema from the values. There's a blog post with examples – let's try it out on logs and see what happens.

Releases are available on Github. Installation is a single binary zip file:

$ wget https://github.com/duckdb/duckdb/releases/download/v0.7.1/duckdb_cli-linux-amd64.zip
$ unzip duckdb_cli-linux-amd64.zip

We can import the JSON into a DuckDB table and save on the repeated processing, using read_ndjson_auto as it will let DuckDB parallelize better. The blog post says "DuckDB can also detect a few different DATE/TIMESTAMP formats within JSON strings, as well as TIME and UUID" – while it did see UUID, it did not see "@timestamp" as rfc3339. Not a huge deal, as we can use the REPLACE clause to manually cast to a timestamp on import:

CREATE TABLE logs AS 
  SELECT * REPLACE CAST("@timestamp" AS TIMESTAMP) as "@timestamp"   
FROM read_ndjson_auto('application.json'); 

And because DuckDB infers schema, when we do a describe logs it shows us all the json attributes as columns!

│ id              │ VARCHAR                                                      
│ relative_ns     │ BIGINT                                                       
│ tse_ms          │ UBIGINT                                                      
│ start_ms        │ UBIGINT                                                      
│ @timestamp      │ TIMESTAMP                                                    
│ @version        │ BIGINT                                                       
│ message         │ VARCHAR                                                      
│ logger_name     │ VARCHAR                                                      
│ thread_name     │ VARCHAR                                                      
│ level           │ VARCHAR                                                      
│ level_value     │ UBIGINT                                                      
│ correlation_id  │ BIGINT                                                       
│ stack_hash      │ VARCHAR                                                      
│ name            │ VARCHAR                                                      
│ trace.span_id   │ UUID                                                         
│ trace.parent_id │ INTEGER                                                      
│ trace.trace_id  │ UUID                                                         
│ service_name    │ VARCHAR                                                      
│ duration_ms     │ UBIGINT                                                      
│ request.method  │ VARCHAR                                                      
│ request.uri     │ VARCHAR                                                      
│ response.status │ UBIGINT                                                      
│ exception       │ STRUCT("name" VARCHAR, properties STRUCT(message VARCHAR))[] 
│ stack_trace     │ VARCHAR                                                      

The exception column is especially interesting, as DuckDB was able to infer name and properties inside it. The struct is 1-based, so the query to match on a specific exception message is exception[1].properties.message:

D select exception[1].properties.message from logs;
Execution exception[[IllegalStateException: Who could have foreseen this?]]

DuckDB's analytics means that we can do back of the envelope queries to see hidden patterns in logs. We can make use of DuckDB's aggregate functions.

We can start off with the average response time:

select avg(duration_ms) from logs;

We can get a better breakdown usinng a 24 hour range, to see if the average response is slower when there's more load:

SELECT AVG("duration_ms") OVER (
  ORDER BY "timestamp" ASC
  RANGE BETWEEN INTERVAL 12 HOURS PRECEDING
        AND INTERVAL 12 HOURS FOLLOWING)
FROM shortlogs;

It's even possible to do box and whisker queries:

SELECT timestamp,
    MIN("duration_ms") OVER day AS "Min",
    QUANTILE_CONT("duration_ms", [0.25, 0.5, 0.75]) OVER day AS "IQR",
    MAX("duration_ms") OVER day AS "Max",
FROM shortlogs
WINDOW day AS (
    ORDER BY "timestamp" ASC
    RANGE BETWEEN INTERVAL 12 HOURS PRECEDING
          AND INTERVAL 12 HOURS FOLLOWING)
ORDER BY 1, 2          

While DuckDB has advantages over SQLite, storage is not stable – newer versions of DuckDB cannot read old database files and vice versa. This has immediate impact as Tad is not capable of loading newer DuckDB files.

In addition, support for the STRUCT data structure in Parquet is iffy. DBeaver is capable of loading the database, but will not render the exception field, instead throwing SQL Error: Unsupported result column type STRUCT("name" VARCHAR, properties STRUCT(message VARCHAR))[]. Tad does support Parquet's struct format, but is a viewer only, so that's not very useful.

Best advice: only use DuckDB for computation and analytics, but use EXPORT DATABASE to snapshot work and write out data as JSON or to an attached SQLite database if you want long term portable storage.

Conclusion

If you want to quickly dig into structured logs, consider using SQLite or DuckDB over trying to process the JSON by hand – they take zero time to install and are vastly more powerful than anything you can do using jq or a log viewer.

Dynamic Logback and Migrating Logback-Showcase

2023-01-04T07:59:15-08:00

I did a couple of small projects over the holidays while drinking coquito, so here they are.

TL;DR I made a very smol dynamic logging project, and migrated terse-logback-showcase from Heroku to fly.io. It's now https://terse-logback-showcase.fly.dev/ and it has pictures of cats.

Dynamic Logback

The first is a "simplest possible dynamic logging" project called dynamic-logback. This is a project that sets up Logback, and then periodically refreshes log levels from a file. The functionality is in one file and is less than 100 lines of code. (After writing this, I did a search on "dynamic logback" on Github and discovered https://github.com/syamantm/dynamic-logback which is a more complete example.)

Anyway, the point is that dynamic logging is easy! You don't need a lot of infrastructure or to set up a database, you can just set up a timer task and be done with it.

There is an assumption that changing log levels requires a total configuration refresh. This is not helped by the documentation that only mentions autoscan as an option, and encourages the setting of levels directly in logback.xml. The reality is that reloading only applies to Logback appenders and their ancillary supports. Log levels can be queried and modified without any heavy lifting. In SQL terms, appenders and filters are the DDL statements, while querying and changing log levels are the DQL and DML statements.

To abstract it from Log4J/Logback APIs, you could add a LogLevelQuery/LogLevelResult interaction for querying log levels, and a LogLevelCommand/LogLevelEvent for modifying log levels, and that would allow for a CQRS style API. That's probably for another project though.

Migrating terse-logback-showcase from Heroku to fly.io

The showcase application used to be on Heroku, but they closed down their free tier. Luckily, fly.io has an option for deploying docker containers.

The application runs on Play Framework, which is JVM based and has built in docker deployment via sbt-native-packager, so the only thing I needed to for a Dockerfile was run sbt docker:publishLocal. Play itself takes up barely any memory, but the JVM is used to having a bunch of memory available – the free tier is only 256 MB, so it was time to get creative.

I found ibm-semeru-runtimes:open-17-jre-focal image with -XX:MaxRAM=70m was enough to get the JVM started, based on a tip from the community forums.

dockerBaseImage := "ibm-semeru-runtimes:open-17-jre-focal"
Universal / javaOptions += "-J-XX:MaxRAM=70m"

Then, I had to add the explicit add-opens required by JDK 17 as --illegal-access=permit is gone. I don't think there's any way to avoid this for now.

Universal / javaOptions ++= Seq(
  "-J--add-opens=java.base/java.lang=ALL-UNNAMED",
  "-J--add-opens=java.base/sun.security.ssl=ALL-UNNAMED",
  "-J--add-opens=java.base/sun.security.util=ALL-UNNAMED"
)

I had to make sure the internal sqlite database used by Blacklite was writable:

dockerChmodType := DockerChmodType.UserGroupWriteExecute

And I had to disable the PID file:

Universal / javaOptions += "-Dpidfile.path=/dev/null"

You can see build.sbt for the actual implementation.

For some reason the logs directory wouldn't be created by sbt docker:publishLocal so I added it to dist directory for deployment:

mkdir -p dist/logs
touch dist/logs/.README

I didn't have to set up HTTPS or anything special for fly.toml. I did have to set up some fly secrets:

fly secrets set PLAY_APP_SECRET=<secret>
fly secrets set SENTRY_DSN=$SENTRY_DSN 
fly secrets set HONEYCOMB_API_KEY=$HONEYCOMB_API_KEY

and then it was just a case of creating docker images and running fly deploy:

sbt clean stage docker:publishLocal
cd target/docker/stage
fly deploy --app terse-logback-showcase

And now https://terse-logback-showcase.fly.dev/ is up! It has pictures of cats.

Using Scalafix to Refactor Logging

2022-11-18T15:20:02-08:00

Problem: you want to use Echopraxia structured logging in your Scala application, but you already have an existing body of logging statements. Solution: Get scalafix to rewrite the logging statements for you!

For Echopraxia, logging statements are based around a field builder API. Scala has string interpolation, so most of the time logging statements don't have string concatenation. Instead, most logging statements look like this:

logger.info(s"thing=$thing")

What we want is to break thing out into the field builder so it's not using an implicit toString call, and can be seen as a unique field in JSON:

logger.info("thing={}", fb => fb.value("thing", thing))

or, for multiple arguments:

logger.info("some field {} another field {}", fb => fb.list(
  fb.keyValue("text", text)
  fb.keyValue("number", number)
))

And we want to be able to recover from the case where exceptions are swallowed:

logger.error(s"exception=$e") // very bad, will swallow stack trace

and render it appropriately, but only for exceptions using fb.exception instead of fb.value:

logger.error(s"exception={}", _.exception(e)) // will render exception and stack trace!

So, this is not a complex refactoring, but it is more complex than IntelliJ IDEA can do out of the box. This is where Scalafix comes in. Scalafix is a refactoring and linting tool that understands how Scala code is structured semantically, using SemanticDB. The SemanticDB support exposes the abstract syntax tree in a Scala program so that it can be recognized and manipulated generally. Before, the AST was available in Scala macros, but not available outside of compilation – you could use Scala macros to autogenerate code, but you couldn't use it to rewrite existing code. As a result of integrating SemanticDB, Scalafix is capable of managing semantic rules like adding type annotations for explicit result types.

I've been interested in Scalafix for a while, but mostly as an end user, and I hadn't thought about writing Scalafix rules myself. After going through it, I recommend everyone learn how to write Scalafix rules, because they can save you so much time and boilerplate, and are really pretty easy to write.

So how does a Scalafix semantic rule work?

The short version is that Scalafix has an input, and an output. The input is a SemanticDocument that contains a tree made up of all the stuff that makes up a program. And for the output, there's a Patch class that returns… strings.

Seriously, that's all there is. You can remove tokens, but if you're patching things to the program, you're adding chunks of text. Initially, I thought that this was very limited, especially after being exposed to the Scala 3 macro program as data model, but for refactoring it removes a number of headaches.

I recommend going through the tutorial, but here I'll walk through how I built up the EchopraxiaRewriteToStructured rule, starting from scratch. The complete source code is here.

The first thing that needs doing is finding the logger statement. The basic unit of Scalafix is pattern matching, so we can start by printing out some likely programs and seeing what tree nodes look likely.

There's a "web site" way called AST Explorer which lets you paste programs in, but I prefer printing it out inline as I'm refining the pattern matching, using foo.structure (you can also reverse it with foo.syntax):

class EchopraxiaRewriteToStructured extends SemanticRule("EchopraxiaRewriteToStructured") {
  override def fix(implicit doc: SemanticDocument): Patch = {
    doc.tree.collect {
      case el =>
        println("${el.structure}") // prints out structure of tree node
        Patch.empty
    }.asPatch
  }
}

From this, we can determine that the statement logger.debug(s"foo") is represented as:

Apply(
  fun = Select(qual = Name("logger"), name = Name("info")),
  args = List(Interpolate(name = "s", parts = List("foo")))
)

This gets at the name, but we also want to check that we're not just latching on to anything called logger – it also has to be of type com.tersesystems.echopraxia.plusscala.Logger.

To do this, we have to get the qualifier's symbol information out, and then pattern match on the signature. We can do this in Scalafix by calling qual.symbol to get the symbol out, and then pulling the SymbolInformation to get at the signature. Once we have the signature, we can use SymbolMatcher to check the Logger symbol against the TypeRef.

Long story short, it looks like this:

class EchopraxiaRewriteToStructured extends SemanticRule("EchopraxiaRewriteToStructured") {
  val loggerClass = "com.tersesystems.echopraxia.plusscala.Logger"

  override def fix(implicit doc: SemanticDocument): Patch = {
    doc.tree.collect {
      case logger @ Term.Apply(
            Term.Select(loggerName, methodName),
            List(Term.Interpolate(Term.Name("s"), parts, args))
          ) if matchesType(loggerName) =>
        Patch.empty
    }.asPatch
  }

  private def matchesType(
      qual: Term
  )(implicit doc: SemanticDocument): Boolean = {
    val loggerSymbolMatcher = SymbolMatcher.normalized(loggerClass)
    val info: SymbolInformation = qual.symbol.info.get
    info.signature match {
      case MethodSignature(_, _, TypeRef(_, symbol, _)) =>
        loggerSymbolMatcher.matches(symbol)
      case other =>
        false
    }
  }
}

Now that we have a relevant logging statement, it's time to rewrite it. We can do this using Patch.replaceTree, which will replace the args inside the Apply node.

Patch.replaceTree(logger, rewrite(loggerName, methodName, parts, args))

Rewriting the code is… a string. The parts are always Lit.String, so calling lit.value.toString and sticking "{}" in between is the easiest way to parameterize them. Then, it's time to serve up the rewritten logging statement as s"""$loggerTerm.$methodTerm("$template", fb => $body)""", and account for some edge cases:

class EchopraxiaRewriteToStructured extends SemanticRule("EchopraxiaRewriteToStructured") {
  // ...
  private def rewrite(
      loggerTerm: Term,
      methodTerm: Term,
      parts: List[Lit],
      args: List[Term]
  )(implicit doc: SemanticDocument): String = {
    if (args.isEmpty) {
      val template = parts.map(_.value.toString).mkString("{}")
      s"""$loggerTerm.$methodTerm("$template")"""
    } else {
      val template = parts.map(_.value.toString).mkString("{}")
      val values = args.map {
        case arg: Term.Name =>
          if (isThrowable(arg.symbol.info.get.signature)) {
            s"""fb.exception($arg)"""
          } else {
            s"""fb.$fieldBuilderMethod("$arg", $arg)"""
          }
        case other =>
          // XXX I don't think this is possible?
          s"""fb.$fieldBuilderMethod("$other", $other)"""
      }
      val body =
        if (values.size == 1) values.head
        else s"""fb.list(${values.mkString(", ")})"""
      s"""$loggerTerm.$methodTerm("$template", fb => $body)"""
    }
  }
}

So hang on a sec… how do we know an argument is a throwable?

This is where it gets really interesting, because this is where we start running into the limits of Scalafix. Scalafix can look at the structure of a type, but does not expose the subtyping information of a type. This is a problem, because exceptions rely heavily on subtyping to work.

However, there is a hack that we can try. From poking at the issues, we can try java runtime reflection to instantiate the class, and see if it's assignable from Throwable. I don't love the manual hacking on the symbol to kludge it into a fully qualified class name, but it'll work.

class EchopraxiaRewriteToStructured extends SemanticRule("EchopraxiaRewriteToStructured") {
  // ...
  def isThrowable(signature: Signature): Boolean = {
    def toFqn(symbol: Symbol): String = symbol.value
      .replaceAll("/", ".")
      .replaceAll("\\.$", "\\$")
      .stripSuffix("#")
      .stripPrefix("_root_.")

    signature match {
      case ValueSignature(TypeRef(_, symbol, _)) =>
        val cl = this.getClass.getClassLoader
        try {
          classOf[Throwable].isAssignableFrom(cl.loadClass(toFqn(symbol)))
        } catch {
          case e: Exception =>
            false
        }
      case _ =>
        false
    }
  }
}

Finally, let's add some configuration so that we can account for custom loggers and custom field builder methods. This is very simple: plop down a Config and a withConfiguration method and we're pretty much done:

import metaconfig.{ConfDecoder, Configured}
import metaconfig.generic.Surface

// ...

class EchopraxiaRewriteToStructured(
    config: EchopraxiaRewriteToStructured.Config
) extends SemanticRule("EchopraxiaRewriteToStructured") {

  private val loggerClass: String = config.loggerClass
  private val fieldBuilderMethod: String = config.fieldBuilderMethod

  def this() = this(EchopraxiaRewriteToStructured.Config())

  override def withConfiguration(config: Configuration): Configured[Rule] =
    config.conf
      .getOrElse("EchopraxiaRewriteToStructured")(this.config)
      .map { newConfig => new EchopraxiaRewriteToStructured(newConfig) }

  // ...
}

object EchopraxiaRewriteToStructured {
  case class Config(
      loggerClass: String = "com.tersesystems.echopraxia.plusscala.Logger",
      fieldBuilderMethod: String = "value"
  )

  object Config {
    val default = Config()

    implicit val surface: Surface[Config] =
      metaconfig.generic.deriveSurface[Config]
    implicit val decoder: ConfDecoder[Config] =
      metaconfig.generic.deriveDecoder(default)
  }
}

And that's it! I hope this shows how simple and straightforward refactoring in Scalafix can be. For extra credit, I also have EchopraxiaWrapMethodWithLogger that will wrap a method in a flow or trace logger.

Latency and Throughput With Logback

2022-10-16T19:53:50-07:00

I've been working with Logback for a while now, and one of the things that stands out is how people will talk about "fast" or "performant" logging, with the theory that picking the right encoder or the right appender will make things work. It's not wrong, but it's not exactly right either.

So, this blog post discusses latency and throughput in Logback, along with some fun non-obvious things that can cause production issues if you're not careful. And it has pictures!

Latency

Latency is defined as the amount of time required to complete a single operation.

Latency is a surprisingly slippery concept, because as soon as you start aggregating latency times, you can end up with visualizations that can omit or obscure parts of the picture. Latency can be defined as averages, percentiles, histograms (useful for "long tail" latency), or heatmaps.

Because we're talking about conceptual latency here, we'll talk about the "average" latency between a logging statement, and a statement being logged.

@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.NANOSECONDS)
class SLF4JBenchmark {
  import SLF4JBenchmark._

  @Benchmark
  def boundedDebugWithTemplate(): Unit =
    if (logger.isDebugEnabled) {
      logger.debug("hello world, {}", longAdder.incrementAndGet())
    }
}

And using an encoder and appender like this:

<configuration>
  <appender name="FILE" class="ch.qos.logback.core.FileAppender">
    <file>testFile.log</file>
    <append>false</append>
    <immediateFlush>false</immediateFlush>
    <encoder>
      <pattern>%-4relative [%thread] %-5level %logger{35} - %msg%n</pattern>
    </encoder>
  </appender>
</configuration>  

Say that boundedDebugWithTemplate takes roughly 871 nanoseconds as measured by JMH. We can visualize this as a straight line, from the time of logging to the time that bytes were appended to a file.

But logging is made up of several operations. For example, if we swap out the file appender for an no-op appender that does nothing but create the logging event and a message based off the template, we can see that the same operation takes only 33 nanoseconds. If we set the logger to INFO level, we can see the isLoggingDebug call takes only 1.6 nanoseconds. So in reality, what we're looking at is more like this:

Because the FileAppender is blocking and Logback runs everything in the calling thread, this means that turning on debugging in an operation will add ~871 ns to every call.

This also compounds for every blocking appender. The initial costs of putting together the logging event happen once, but if you have a STDOUT appender, a file appender, and a network appender, they all encode the logging event using distinct encoders, and render sequentially on the same thread.

In practical terms – the more appenders you add, the slower your code gets when you log.

It's important to note at this point how tiny a latency of 871 nanoseconds is – for comparison, instantiating any Java object costs around 20 nanoseconds. For most operations, logging is not the bottleneck compared to the costs of the operation itself – unnecessary database queries, blocking on network calls, and lack of caching are still the low hanging fruit.

However, it is still a cost. Moreover, looking at the average latency doesn't tell you about the outliers – the "long tail" of latency. If an operation blocks in any way, then that cost will be passed on to the application. And blocking can happen in the most insidious of ways.

The obvious source of blocking is when a logging event or message includes a blocking call. For example, calling UUID.randomUUID() blocks because of the internal lock, or calling toString() on a collection that contains java.net.URL objects, causing hundreds of DNS resolutions. This can block an HTTP request for multiple seconds, and it won't be immediately obvious from looking at the logs.

But blocking is not solely an input problem though – blocking can come from Logback itself.

Blocking in Logback can come from appenders. Anything extending AppenderBase uses a synchronized lock that ensures only one thread is appending. While it looks like blocking in appenders is a small consistent cost, this is not always the case. For example, a rolling file appender can block on rollover. LOGBACK-267 means that if you use FixedWindowRollingPolicy and enable compression by specifying a .gz suffix, then compressing multi-gigabyte files can stall the appender, blocking all logging for 55 to 69 seconds. The underlying cause is that FixedWindowRollingPolicy.java calls compressor.compress, as opposed to TimeBasedRollingPolicy.java which uses compressor.asyncCompress.

You might think the problem of blocking can be easily avoided, but it's not quite that simple. Blocking can happen at the kernel, even when writing to memory mapped files, as the operating system manages writes. This causes issues. Filesystem blocking can occur even in software RAID or a network backed VFS. In short, when files were created this made lots of people angry, and was widely regarded as a bad move. I suspect that the TCP appenders and TCP network stack work differently, but then the assumption is that the network is reliable.

Asynchronous Logging

There is a way to avoid unanticipated blocking: we can log asynchronously. Asynchronous logging is a trend, with asynchronous GC logging in Corretto 17 coming out for the JDK itself.

There are several ways to implement asynchronous logging. Echopraxia can address it at invocation with an asynchronous logger, deferring argument construction and condition evaluations and allowing caller information for free, at the cost of a more complex method interface. Alternatively, asynchronous logging can be implemented in an appender, although this does mean that argument and LoggingEvent construction happen on the calling thread.

Logback does have an out of the box async appender, but the LoggingEventAsyncDisruptorAppender from logstash-logback-encoder is much richer from a feature-based perspective; it by default drops all events when full, can warn when full, and has more customization available on ringbuffer size and behavior. From a performance perspective I'd say it's a wash for most people – note that the logback performance page discusses throughput, so it's not an apples to apples comparison.

<appender name="async" class="net.logstash.logback.appender.LoggingEventAsyncDisruptorAppender">
    <appender class="ch.qos.logback.core.rolling.RollingFileAppender">
        ...
    </appender>
</appender>

An async appender will accept LoggerEvent, and will write to an in-memory ring buffer that is used by a dedicated thread to write to the enclosed appenders.

On average, the mean latency for a disruptor is ~50 nanoseconds, up to a worst case scenario of 420 ns when the queue is fully loaded. This means that the rendering thread only incurs the latency cost of 33 ns (eval + logback event) + 50 ns (enqueuing), but does not incur the latency cost of appending to file. An asynchronous boundary exists between the thread running the operation, and the thread that picks up the logger and writes to the appenders.

Using multiple threads enables logging to be concurrent, running alongside operations without interfering with them. There is a difference between concurrency and parallelism: if there's only one core available, then the two threads may run interleaved, and there may be a small delay in writing the logs. If there are multiple cores available though, then typically the thread will be writing logs in parallel.

There are some special cases / catches to asynchronous logging.

The first catch is not adding a shutdown hook; you need to let the ring buffer gracefully shutdown, and if Logback shuts down immediately you will miss events that could be critical.

The second catch is to use unnecessary async appenders, each wrapping a single appender. This can be a waste of threads; you only need one to create an asynchronous boundary. If you do not anticipate significant load and your appenders are fast, my recommendation is to define the appender at the root, before you do anything else.

<configuration>
  <shutdownHook class="ch.qos.logback.core.hook.DelayingShutdownHook">
    <delay>150</delay>
  </shutdownHook>
  
  <root level="all">
    <appender class="net.logstash.logback.appender.LoggingEventAsyncDisruptorAppender">
      <appender name="FILE">...</appender>
      <appender name="STDOUT">...</appender>
      <appender name="TCP">...</appender>
    </appender>
  </root>
</configuration>

The third catch is what happens to asynchronous logging when there is significant load. Ring buffers can fill up when the underlying appenders are slow and do not drain the buffer fast enough, and a full ring buffer can result in dropped events.

Therefore, if you do have an appender that's awkward (and you can't fix it), you should configure a distinct appender for it and configure it so it doesn't jam up the others.

<configuration>
  <root level="all">
    <appender class="net.logstash.logback.appender.LoggingEventAsyncDisruptorAppender">
      <appender name="FILE">...</appender>
      <appender name="STDOUT">...</appender>
      <appender name="TCP">...</appender> 
    </appender>
    
    <appender class="net.logstash.logback.appender.LoggingEventAsyncDisruptorAppender">
      <ringBufferSize>[some large multiple of 2]</ringBufferSize>      
      <appender class="RollingFileAppender">
        <!-- trigger LOGBACK-267 -->
        <rollingPolicy class="FixedWindowRollingPolicy">
          <fileNamePattern>backup%i.log.gz</fileNamePattern>
          ...
        </rollingPolicy>
        <triggeringPolicy>
          <maxFileSize>4GB</maxFileSize>
        </triggeringPolicy>
        <encoder>...</encoder>
      </appender>
    </appender>
  </root>
</configuration>

You may lose some events if it spills over, but that's better than stalling your application.

You can also add an appender listener to notify you of any dropped messages. The FailureSummaryLoggingAppenderListener implementation will log a summary of any dropped messages, but it does have the drawback that the listener logs the summary to the same appender that is dropping messages – so the summary itself can be lost. You are better off writing your own implementation from the interface, and using it to send to your metrics or error reporting system in a scheduled runnable using the ScheduledExecutionService from Logback's Context.

Throughput

The throughput of an application is how many operations it can process over a period of time. This is typically the metric we care about for batch operations and things that happen in the background.

Throughput is a tricky quantity to measure, because doing operations in bulk improves throughput, but can cause applications to seem unresponsive, and vice versa. For example, writing to STDOUT is "fast" because of I/O buffering, but writing STDOUT to a terminal is slow, because users expect immediate feedback.

Let's see what raw disk throughput looks like on my laptop.

public class Main {
  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Main.class);

  public static void main(String[] args) {
    Timer timer = new Timer();
    timer.schedule(new TimerTask() {
      @Override
      public void run() {
        System.out.println("Exiting");
        System.exit(0);
      }
    }, TimeUnit.MINUTES.toMillis(1));

    int threads = 1
    for (int i = 0; i < threads; i++) {
      final String name = "logger-" + i;
      Thread t = new Thread(name) {
        @Override
        public void run() {
          while (true) {
            logger.info("Hello world!");
          }
        }
      };
      t.start();
    }
  }
}

Using the sifting appender together with the thread name based discriminator.

<configuration debug="true">

  <appender name="FILE" class="ch.qos.logback.core.FileAppender">
    <file>application.log</file>
    <append>false</append>
    <immediateFlush>false</immediateFlush>
    <encoder>
      <pattern>%-4relative [%thread] %-5level %logger{35} - %msg%n</pattern>
    </encoder>
  </appender>
  
  <appender name="SIFT" class="ch.qos.logback.classic.sift.SiftingAppender">
    <discriminator class="org.example.ThreadNameBasedDiscriminator"/>
    <sift>
      <appender name="FILE-${threadName}" class="ch.qos.logback.core.FileAppender">
        <file>${threadName}.log</file>
        <append>false</append>
        <immediateFlush>false</immediateFlush>
        <encoder>
          <pattern>%-4relative [%thread] %-5level %logger{35} - %msg%n</pattern>
        </encoder>
      </appender>
    </sift>
  </appender>

  <root level="INFO">
    <appender-ref ref="FILE"/>
  </root>

</configuration>

Run it for 1 minute, 1 thread, one FILE appender (100% core usage) = 5.5GB

-rw-rw-r-- 1 wsargent wsargent 5815185001 Oct 16 14:13 application.log

There are use cases where you may want to log this much! If you want to enable total debug output on your local filesystem and nothing else is involved in file IO, then yeah, why not? This is the premise behind diagnostic logging in Blacklite, where you keep a rolling buffer of debug events in SQLite that you can dip into for extra context when an error occurs.

However, in most circumstances, your issue will be too much throughput rather than too little. I've already written about the costs involved in indexing and storing logs in Logging Costs so I won't go over them again – suffice to say that your devops team will not be happy at these numbers if you are planning to log at INFO or above and send it to a centralized logging environment.

Instead, I want to look more at the bottom line number. Is there any way we can make this faster?

First, let's just up the number of input threads logging, just to confirm that the bottleneck is on the backend.

1 minute, 4 threads, one FILE appender = 4.4GB

-rw-rw-r-- 1 wsargent wsargent 4777409363 Oct 16 16:37 application.log

Huh. Throughput goes down when we log from multiple threads. This is likely due to single writer principle.

But maybe it's the encoder that's slowing things down before it reaches the filesystem. One tip from this podcast where Daniel Lemure says that CPU bottlenecks can come before IO bottlenecks, even though it's not popular orthodoxy. Using htop, it looked like it was maxing out a single core in the process, so let's work from there.

If we ran several threads to several different files, then we can avoid the bottleneck on a single core. Let's see what happens when we use the sifting appender to multiplex multiple threads to multiple files.

1 minute, 2 threads, sifting appender (90% core usage on two threads) = 3.3 GB:

-rw-rw-r-- 1 wsargent wsargent 1804883163 Oct 16 14:16 logger-0.log
-rw-rw-r-- 1 wsargent wsargent 1614980252 Oct 16 14:16 logger-1.log

1 minute, 4 threads, sifting appender = 3.5 GB

-rw-rw-r-- 1 wsargent wsargent 977745435 Oct 16 14:20 logger-0.log
-rw-rw-r-- 1 wsargent wsargent 914416956 Oct 16 14:20 logger-1.log
-rw-rw-r-- 1 wsargent wsargent 954032694 Oct 16 14:20 logger-2.log
-rw-rw-r-- 1 wsargent wsargent 968975679 Oct 16 14:20 logger-3.log

Okay, so it's not that the encoder is the bottleneck. Instead, it appears to be the disk, and switching contexts between threads doesn't help the throughput. Let's write to tmpfs instead, using the default /dev/shm.

1 minute, 1 thread, one FILE appender /dev/shm/application.log = 8.5 GB:

-rw-rw-r-- 1 wsargent wsargent 9027620209 Oct 16 17:27 application.log

Ah-ha! That almost doubled the throughput!

Now lets see what happens if we add more threads, just to check.

1 minute, 2 threads, one FILE appender /dev/shm/application.log = 8.0 GB:

-rw-rw-r-- 1 wsargent wsargent 8563431350 Oct 16 18:06 application.log

Yep, that again reduces the throughput.

The message to take away from this is that if you want to maximize throughput, it's not a question of picking the right logging framework, or the right appender – you need to look at your bottlenecks.

Summary

Use conditional guards and avoid creating objects or calling methods that may block (which may include toString).
Use logback-logstash-encoder, preferably with a listener so you can tell if the queue fills up.
Be careful of edge cases with compression and rollover.
Usually the concern is too much throughput rather than too little.
If you really need it, use /dev/shm or Blacklite for your log storage.