Logging vs Memory

I've been continuing to work on Blindsight, focusing on performance, cleaning up some of the internals, and adding some new features like statement interpolation and event buffers.

The good news: after performance tweaks, almost all operations in Blindsight cost less than 100ns when logging is enabled, evaluating a dynamic condition costs between 16ns and 24ns, and when using Condition.never it is down to 1 ns. There are fancy ways to eliminate disabled calls entirely, but I'm happy with this for now.

Allocation Costs

Assembling a benchmark suite taught me a lot about the relative costs of JVM operations. Whenever I found optimizations, it was because it was faster to repeatedly check a predicate rather than create a new object. Checking a predicate costs 1 to 3 ns. Creating a new object never takes less than 20 ns. CPU has been getting faster than RAM for a while now, so this makes sense, but it's another thing to see it in practice.

Creating an object is very cheap, but it's not free. When you call new Foo(), then the JVM will create a new instance of Foo. For every thread, there's a thread-local allocation buffer that new objects can be placed in. If you run a profiler and you see huge int[] arrays in the JVM, those are actually unused TLAB buffers waiting to be filled with new objects. If the object is too large, then there's the option of a slow allocation and the possibility of some TLAB waste. Once a new object is created, it goes through system initialization, which sets fields to default values and then user initialization which calls the constructor, although the JVM may coalesce initialization when they know nobody will call the bluff. So there are initialization costs, although for an array the cost is largely dominated by metadata writes: new byte[1] is roughly 20 ns.

I found this out again when I started benchmarking a macro based statement interpolator. My first implementation parsed out arguments into parameterized logging with an object array initialization:

val dayOfWeek = "Monday"
val temp = 72 
val statement: Statement = st"It is ${dayOfWeek} and the temperature is ${temp} degrees."

statement.message must be("It is {} and the temperature is {} degrees.")
statement.arguments must be(Array(Argument(dayOfWeek), Argument(temp)))

I thought I could make this faster by changing the macro to identify primitives that would be toString in any case, and using a thread-local StringBuilder to append them, like fast-string-interpolator:

statement.message must be("It is " + dayOfWeek + " and the temperature is " + temp + " degrees.")

Nope. Not only did string concatenation cost more in allocations, but it also increased with the number of arguments. It turns out that the best performance was 29 ns, using parameterized string literals. It's one string vs many, there's only one array allocation, and the interpolation happens only if logging is enabled, in a different thread.

This showed up again when I started looking at the fluent API. For the fluent API, accumulating a statement means holding functions until the final log statement is called:

case class BuilderImpl(
  mkrs: () => Markers,
  m: () => Message,
  args: () => Arguments,
  e: Option[Throwable]
) { 
  // ...
  override def argument[T: ToArgument](instance: => T): FluentMethod.Builder = {
    copy(args = () => args() + Argument(instance))
  }

  override def log(): Unit = {
    val markers = mkrs()
    if (shouldLog(markers)) {
        val statement = e
        .map(ee => Statement(markers, m(), args(), ee))
        .getOrElse(Statement(markers, m(), args()))
        executeStatement(statement)
    }
  }
}

This means that the fluent API is the slowest part of Blindsight: 274 ns when fully loaded with arguments.

Allocation Pressure

Allocation also has a more subtle cost. My initial thought was that short lived objects are free. Either they got allocated to a TLAB in Eden and were picked up in a minor GC when no longer reachable, or they were local variables subject to scalar replacement. The last time I focused on benchmarks, it was looking at the overall logging costs and pointing out that the latency and storage costs of logging are far more important than the throughput. However, there's a performance metric which I didn't cover in that blog post: memory allocation rate, and the pressure it puts on the JVM in garbage collection.

Allocation pressure is mostly mentioned in passing and shows up under different names, for example "GC pressure" or "GC churn" which makes it more difficult to track down.

Too much allocation pressure, and more minor GCs are needed to scan the live objects and move them to survivor space. This in turn means that some objects survive long enough to gain tenure and be moved to the Old generation, where they are managed by major GC which is much more expensive. This is expensive and again shows up under different names such as "premature tenuring" or "premature promotion".

The interesting thing is that excessive allocation pressure is a very common problem. Kirk Pepperdine goes into this problem in The Trouble with Memory:

"You get all of these funny downstream costs that you don’t even think about. In terms of the allocation, it’s still quick. If the objects die very quickly, there’s zero cost to collect them, so that’s true. That’s what garbage collection people have been telling you all the time, “Go, don’t worry about it. Just create objects. It’s free to collect them.” It may be free to collect them, but quick times a large number does equal slow. If you have high creation rates, it’s not free to create. It may be free to collect, but it’s not free to create at the higher rate." – The Trouble with Memory

And follows up on that at JFocus: at 10:25 he starts talking about logging frameworks and the impact they have on performance:

Pepperdine repeats this anecdote getting 2x or 4x performance:

So I gave my usual logging rant at a workshop that I gave. I was in the Netherlands about a year or so ago. And that night, they went and stripped all the logging out of their transactional monitoring framework that they’re using, which wasn’t getting them the performance they wanted, which is why I was there giving the workshop in the first place. And when they stripped out all their logging, the throughput jumped by a factor of four. And they’re going like, “Oh, so now our performance problem not only went away, we’ve been tuning this thing so long that actually, when they got rid of the real problem, it went much faster than they needed, like twice as fast as what they needed to go.” – The Trouble with Memory

This is not to say that logging is bad. The issue that Pepperdine mentions in detail is that logging APIs do not take memory allocation into account. He goes into this in detail in the mechanical sympathy mailing list:

I've been railing against Log4J, JDK logging and more recently LogBack for being completely inadequate as logging frameworks. I don't see that logback or SLF4J have really addressed any of the multitude of problems that are presented when using these frameworks. In addition, the similar ill thought out sloppy interfaces were proposed in JEP158 with the Unified JVM logging framework. – Thu, 13 Jun 2013 22:42:47

And discusses logging patterns and idioms as the root cause:

IM(not so)HO, all logger and in particular JDK logging has been a disaster from both an API and underlying design POV. […] The API’s in all loggers encourage programming styles that are responsible for excessive memory pressure. Though I gripe about the obscene amount of work that logging frameworks engage in, generally it’s not the logging framework that is directly responsible for memory pressure, it’s what has to be done to use the logging framework that is the source of the problem. – Thu, 26 May 2016 01:32:27

Pepperdine goes into detail into the problems with the API, notably string concatenation:

Are you suggesting that with the new version you are considering moving away from static methods and logging strings as a standard API? I think the standard log4j API can be preserved and the performance greatly improved, however this API places serious limitations on performance, semantics, and testability.

Forced string concatenation being one of the biggest. – Tue, 07 May 2013 04:35:07

This makes sense, considering the overhead of allocating a StringBuilder, and fast-string-interpolator backs this up by showing the performance savings of reusing a thread-local StringBuilder for concatenation.

Blindsight and Memory Allocation

My belief is that Blindsight handles a number of these concerns. Statements are converted to parameterized strings at compile time using statement interpolation, so there is no string concatenation in play. Conditional logging ensures that logging can be dynamically scaled up and down at runtime in response to application needs. And finally, the type class system ensures that only the exact information needed for logging is rendered, and you have the option of importing different implicit type classes.

There are a couple of things that could be changed in Blindsight: String has an implicit conversion to Message, which makes far too easy to do logger.info(s"${herp} ${derp}") instead of logger.info(st("${herp} ${derp}")) and get string concatenation under the hood. So maybe that should be more explicit, and maybe include an API that is completely Message-less.

That's another reason I've been on the fence about thunks and blocks. While thunks and blocks look nicer than an if statement, they use call-by-name evaluation, which involves one-off closure objects for each logging statement. For the sake of diagnostic logging, I added lazy blocks in 1.3.0, passing in a method to a block that's only executed if the predicate check is successful:

logger.debug { log =>
  val debugInfo = ...
  log(st"This can be an expensive ${response} with ${debugInfo} information")
}

I ran some benchmarks JMH using the -prof gc option in JMH. What I found confirmed my suspicions. I put together a full statement on memory performance in Blindsight, but the TLDR version:

When disabled, lazy logging is free. No object allocation, CPU time is same as predicate.
When enabled, there is indeed a small object allocation overhead.

I subsequently found Ionuț Baloșin has a post on logging patterns and their performance impact, which finds much the same thing.

If you're logging for operational purposes, e.g. an error or an event then you may be better off using conditional logging with regular logging statements rather than using lazy logging, or at least a guarding if statement.

So now, it's time to do something fun. Let's go through what you should do to make sure you don't have memory allocation pressure problems or GC pauses.

Finding Allocation Rate

So, the first thing to do is to find the memory allocation rate. This is harder than it seems, because the JVM doesn't expose this directly:

Memory churn is an issue. But observability bad so cant see. – QCon Java Performance Panel

but there is some guidance on what to look for:

I'd say that anytime you're above one gigabyte per second - remember, we tend to allocate the buffer 100 bytes at a time. If we did it a gig at a time, each individual location takes the same amount of time, but if we're doing it 100 at a time then, of course, you have a frequency issue. And the frequency issue seems to be really bad when you cross this one gig threshold approximately. And by the time you get it below 300 megabytes per second, then pretty much I'm just going to ignore that problem. In between it's sort of like, well, I may or may decide to do something about it, depending upon what other things that are going on. – The Trouble with Memory

So the ideal allocation rate is below 300 MB/second, but generally anything under 1 GB/second is fine.

We'll find the allocation rate and add some profiling in three different ways, using Java Flight Recorder for a picture of overall health, and using async-profiler and jvm-alloc-rate-meter for finer grained detail. Technically, you can also find the allocation rate by performing some calculations on GC logging, but this is manual and error-prone.

Add jvm-alloc-rate-meter

The first thing we need is an in-house metric of allocation rate. For that, we need jvm-alloc-rate-meter. You can check mvnrepository for the latest version.

resolvers += "clojars" at "https://repo.clojars.org/"
libraryDependencies += "nl.grons" %% "metrics4-scala" % "4.1.9"
libraryDependencies += "com.clojure-goes-fast" % "jvm-alloc-rate-meter" % "0.1.3"

And then hook it into metrics-scala and add it as a singleton in your framework:

@Singleton
class AllocService @Inject()(lifecycle: ApplicationLifecycle) extends DefaultInstrumented  {
  private val logger = LoggerFactory.getLogger(getClass)

  import com.codahale.metrics.Histogram
  import com.codahale.metrics.SlidingTimeWindowArrayReservoir
  import java.util.concurrent.TimeUnit

  private val hist = new Histogram(new SlidingTimeWindowArrayReservoir(10, TimeUnit.SECONDS))
  metrics.registry.register("jvm_alloc_rate", hist)

  private val t = new MeterThread(hist.update(_))
  t.start()

  logger.info("New allocation thread created")

  lifecycle.addStopHook { () =>
    logger.info("Terminating allocation meter thread")
    Future.successful(t.terminate())
  }
}

After that, you can pull the histogram and log the mean allocation rate:

class AllocationRateReporter extends DefaultInstrumented {
  import com.tersesystems.blindsight._
  
  private val logger = LoggerFactory.getLogger

  def report: Unit = {
    val snapshot = metrics.registry.histogram("jvm_alloc_rate").getSnapshot
    val mean = snapshot.getMean
    
    import DSL._
    val allocRateFields = bobj(
      "alloc_rate.mean_bsec" -> mean,
      "alloc_rate.mean_mbsec" -> mean / 1e6,
    )
    logger.info(Markers(allocRateFields))
  }
}

Or use any of the metrics reporters.

Also worth mentioning jvm-hiccup-meter, which works like jHiccup but again lets you export the metric into your observation framework.

After that, we can looking at where and how memory is being allocated. For this, Java Flight Recorder (which we can leave on all the time) and profilers (which we can attach to running processes).

Enable Flight Recording

Java Flight Recorder comes out of the box in JDK 14, and does not require Oracle licensing. We'll add the "profile" settings here which has a bit more overhead but is very useful.

-XX:StartFlightRecording:disk=true,dumponexit=true,filename=memalloctest.jfr,maxage=10m,settings=profile

Since you're starting the JVM with JFR already running, you can bring up Java Mission Control and connect to the process to see the flight recorder as it's running. You can see the allocation rate using the allocation rate tab, as it provides a direct view. Keep in mind that the allocation rate bars that JFR shows you is aggregated over 15 seconds by default and you have to zoom in with the mousewheel until it gives you the MB/sec view.

There's other stuff you can do with JFR like event streaming and flamegraphs, but we won't deal with it here.

Add Async Profiler

async-profiler is a low overhead sampling profiler that does not suffer from safepoint bias. It's lower level than JFR, and it's very good at putting together a view of where the application is spending the most time.

We want to enable the following flags:

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

This will ensure that simple inlined methods will show up in the profile.

You'll also want to turn on the following options for CPU profiling:

sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'

sudo sh -c 'echo 0 > /proc/sys/kernel/kptr_restrict'

You will need a version of the JVM with debug symbols. The easiest way to do this is by using a fastdebug build.

If you are on Linux, you can download a fastdebug binary from https://www.chriswhocodes.com/ and install it with Jabba:

jabba install fastdebug@11=tgz+https://www.chriswhocodes.com/downloads/jdk11-chriswhocodes-nightly-linux-x86_64-fastdebug.tgz
jabba use fastdebug@11

Jabba also supports XZ format, so you can download nightly and 15 builds if you want to try out bleeding edge stuff:

jabba install shipilev-fastdebug@nightly=tgx+builds.shipilev.net/openjdk-jdk/openjdk-jdk-latest-linux-x86_64-fastdebug.tar.xz
jabba install shipilev-fastdebug@15=tgx+https://builds.shipilev.net/openjdk-jdk15/openjdk-jdk15-latest-linux-x86_64-fastdebug.tar.xz

Running the profiler in allocation mode can be done with the following:

./profiler.sh -e alloc -d 30 -f alloc-flame-graph.svg --title "Allocation profile" --width 1600 $(pgrep -f highcpu)

This runs the profiler for 30 seconds and generates a flame graph called alloc-flame-graph.svg that contains the allocation hotspots.

Low Overhead Heap Profiling

Alternately, low-overhead heap profiling is a new JVMTI feature in Java 11 that has some advantages over JFR's view of allocation and may be more accurate than async-profiler. You can use heapsampler to produce a flame graph output.

The biggest pain is that you have to compile heapsampler yourself:

$ git clone https://github.com/odnoklassniki/jvmti-tools.git $HOME/work/jvmti-tools
$ cd $HOME/work/jvmti-tools/heapsampler
$ sdk use java 14.0.1.hs-adpt    
$ g++ -O2 -fPIC -shared -I $JAVA_HOME/include -I $JAVA_HOME/include/linux -olibheapsampler.so heapsampler.cpp

And then you can run it:

jcmd <pid> JVMTI.agent_load $HOME/work/jvmti-tools/heapsampler/libheapsampler.so [interval]

Another option, is that if you're using a recent version of Yourkit, JEP-331 is included out of the box and you can go to the allocation telemetry. For a closer look, you can use the object counting mode.

Be careful when using profilers, as they can change the inlining that C2 does so that the escape analysis and scalar replacement isn't quite the same. You can use JITWatch if you're really focused on the nitty-gritty details.

Heapothesys / HeapFragger

There's a couple of tools you can run which let you play with memory allocation rates directly, rather than trying to induce it with a load test.

Heapothesys is an Amazon Corretto tool that simulates memory allocation as a benchmark. You can use it to play with a JVM's heap allocation rate and the heap occupancy without having to run a load test explicitly, and figure out from there how the garbage collector responds. It's designed to be run standalone, so you can't just plug it into the application, but it's a great way to see how the garbage collector responds to different inputs.

HeapFragger is a set that also runs memory allocation rate – this has the advantage that you can run it inside your application so you can see how the app responds when starved for resources. You'll typically run it as a Java agent:

java -Xmx4g -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -Xloggc:g1gc.log -javaagent:HeapFragger.jar="-a 1000 -s 1024 -v" -javaagent:jHiccup.jar="-d 0 -i 1000 -l g1gc.hlog" org.heaputils.Idle -t 1000000000

But you can also call HeapFragger inside your code directly to mess it up.

Fixing Allocation Rate

The simplest thing to do is to add more memory, and use G1GC. Provide at least 2 GB of memory, because you want the region size to be at least 1 MB.

You can resize the New region directly with XX:G1MaxNewSizePercent, but doing so will mean that the G1GC pause time ergonomics will no longer work. You can see a detailed GC analysis here, and Monica Beckwith has a detailed description of G1GC in JDK 9.

Also be aware that there are some odd edge cases in G1GC where -XX:G1MixedGCLiveThresholdPercent=100 may be more effective.

One warning: don't use DisableExplicitGC, as it may disable the reclaimation of off-heap direct ByteBuffers. Especially important if you're using Netty or another system that relies on NIO heavily. No idea why ExplicitGCInvokesConcurrent isn't the default.

Tweak Inlining and Biased Locking

Inlining is the first step to general JIT optimizations, and the defaults are set too low. The latest JDK increases them.

-XX:MaxInlineLevel=18 -XX:MaxInlineSize=270 -XX:MaxTrivialSize=12

This has been reported multiple times, and there's no down side. It's even in the default SBT jvmopts now. Note that the Bloop documentation on using ParallelGC for throughput was backed out of sbt, because the worst case scenario is Not Good, and G1GC is more reliable overall.

While you're at it, you can disable biased locking, which is a hack for the earlier Java collections API which used synchronized all over the place. In JDK 15, biased locking will be deprecated and disabled.

-XX:-UseBiasedLocking

According to Hazelcast JET, any low-latency Java application should have this disabled.

Enable Scala Inlining

Scala 2.12 comes with an inlining optimizer, but it's disabled by default and you have to specify the scope of inlining with the compiler options. For example, Blindsight runs through the inliner with the following options:

scalacOptions ++= Seq(
  "-opt:l:inline",
  "-opt-inline-from:**",
  "-opt-warnings:any-inline-failed"
)

You do not need to add the @inline annotation to your methods. It will just work, and it will give the C2 compiler more to play with.

The downside to inlining is that debugging an inlined application is not as accurate, because the inliner does not futz with the debug info in bytecode to get the line number information to line up appropriately. In addition, libraries that are published to be consumed by other projects should not inline code from the classpath.

Turn on GC logging on tmpfs filesystem

You'll want to turn on GC logging so you can get better insight into what garbage collection is doing. This is especially useful for Shenandoah, which has very useful logging.

There is a big problem with GC logging though – it can cause large GC pauses on background IO traffic. Background IO traffic, of course, is exactly what logging produces.

For this reason, you should log to a tmpfs mount, which ensures that an fsync has no effect:

sudo mkdir -p /mnt/tmpfs
sudo mount -t tmpfs tmpfs /mnt/tmpfs

And then you can specify the file path and options safely:

-Xlog:gc*,gc+age=trace,gc+heap=debug,gc+promotion=trace,safepoint:file=/mnt/tmpfs/gc.log:utctime,pid,tags:filecount=10,filesize=1m

You should make sure you have enough memory to store your GC logs so that it never goes to swap. If you want to fit more information into tmpfs, you can use zstandard with logrotate which gives much better compression.

This does mean that you'll lose the GC logs if the entire machine crashes, but you do have the option to save off the files at intervals.

If you'd prefer not to use tmpfs at all, you should use a fast SSD drive for GC logging so this isn't an issue, or at least log to a different disk for GC logs. Don't assume that your cloud instances have fast disk drives. The default disk you get on GCP, regional persistent disks, will top out around 240MiB/s. That is one order of magnitude slower than the NVMe disk on your laptop.

There is also the option of logging GC events internally in the application. You can do this through the GarbageCollectorMXBean in JMX, although it does take some work to set up. The issue there is that JMX is a little awkward to deal with in the base case, and does not have a great track record of being reliable – MemoryPoolMXBean.getCollectionUsage() wouldn't work when both -XX:+UseG1GC and -XX:+ExplicitGCInvokesConcurrent were set, and the CollectionUsage.used values did not reflect the mixed GC results. These bugs are fixed in Amazon Corretto JDK 8, but aren't backported in OpenJDK.

A better option is to make sure you're using a version of the JVM that has Java Flight Recorder enabled. For JDK 1.8, this means downloading u262, which has the JFR Backport. After that, you can dump JFR to disk, and go through the events.

Disable hsperfdata

There's a similar write problem with the hsperfdata files used by jps and jstat. This particular bug took four months to solve, and the blog post is worth reading just for the process of tracking it down.

In this case, the location is hardcoded to /tmp, so you have two options.

The first option is to mount /tmp as tmpfs. This has been recommended as a good all-around win by Dustin Kirkland, and seems to have been the assumption by the hotspot dev team, since Solaris has been doing this since 1994.

$ echo "tmpfs /tmp tmpfs rw,nosuid,nodev" | sudo tee -a /etc/fstab
$ sudo reboot

The second option is to disable it entirely.

-XX:+PerfDisableSharedMem

Turning that off means command line tools like jstat and jps won't work, but that seems a small price to pay.

Limit Heap Byte Buffers

Yet another issue that Evan Jones found is to do with heap byte buffers.

Even when you use a heap ByteBuffer for I/O, it is copied into a temporary direct ByteBuffer. The JDK caches one temporary buffer per thread, without any memory limits. As a result, if you call I/O methods with large heap ByteBuffers from multiple threads, your process can use a huge amount of additional native memory, which looks like a native memory leak. This can cause your process to unexpectedly run into memory limits and get killed.

There's a property to limit the size of the cached buffer in JDK 9 and recent versions of JDK 8:

-Djdk.nio.maxCachedBufferSize=262144

And as Jones says:

I suggest running all services with -Djdk.nio.maxCachedBufferSize=262144 to ensure the JDK doesn't cache buffers larger than 256 kB. I would really love to have this get set as the default, but unfortunately that seems unlikely.

Use Appropriate Production Logging

Much of this is already covered in the previous blog posts and in Terse Logback, but the backend configuration does have an impact on configuration. In particular, you should look out for the following:

Use an LMAX disruptor based asynchronous appender like logstash-logback-encoder or log4j2-async.
Don't log to standard output / console in production. It's always going to be slower than logging to a file or network. Even if you're in a container, there are better options. (Yes, 12factor is wrong here.)
Use a dedicated disk for logging to avoid background IO traffic.
Set up metrics on your logging rates and success/failure rates.
Consider tying the jvm-alloc-rate-meter to a Filter or TurboFilter, so you can immediately reject messages when the allocation rate is too high.
Consider removing or limiting duplicate log messages so that they don't spam the logs.
Consider specifying maximum sizes for log messages as a way dealing with unbounded toString writes, and truncating / filtering when necessary.
Don't use caller info like line numbers or file in your logger configuration, as it relies on 1.8 stack trace mechanics.
Be aware that the cost of logging does increase with the number of fields you log.

Upgrade to JDK 14

Another option that doesn't require code changes is to upgrade to JDK 14. Moving to JDK 14 will immediately improve your GC benchmarks and help reduce allocation rates, even if you stick with G1GC.

This will happen even if you change nothing else. The reason why? Better string management with footprint and performance improvements up to 10%.

Thanks to JEP-261, Strings have a lower memory footprint, because they're backed by byte[] instead of char[]. Strings only containing Latin characters allocate one byte per character, and only use two bytes when storing Unicode codepoints. Less memory, fewer problems.

Likewise, JEP-280 improves string concatenation by leveraging invokedynamic and picking out one of several strategies for concatenating efficiently. You would be amazed how much effort was put into jdk-9 String concatenation.

The interesting thing about JEP-280 is that it appears targeted specifically to string concatenation in the compile stage. Since Scala's 2.12 string interpolation produces intrinsic String concatenation under the hood, this means that Java string concatenation in JDK 9+ might actually be faster than Scala string interpolation, depending on what strategy is used. (This is also a good argument for upgrading your Scala version to at least 2.12 if you haven't already.)

Once you've upgraded to JDK 14, you have more options: you can use G1GC and turn on Graal, or you can switch to one of the new garbage collectors.

Use Graal JIT Compiler

Graal is a new JIT compiler that optimizes bytecode into machine code, like the C2 compiler in Hotspot. The Graal inliner is much more aggressive, which opens the door up to more escape analysis resulting in scalar replacement.

It's available in JDK 14 (GraalVM CE at any rate), so all you need is the following options:

-XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler

In particular, Graal is extremely effective as for applications that produce a lot of objects when there is a high degree of polymorphic calls and a myriad of tiny nested calls, which describes many FP-oriented Scala applications. Twitter in particular has found 11% performance improvement using Graal.

The only downside to Graal is that it doesn't support the newer GCs. However, Graal is stable enough that it's also being used for sbt.

Use Shenandoah or ZGC

Using Shenandoah or ZGC can just make allocation problems go away. They are super fast. Typical pauses for both are less than a millisecond.

For ZGC, you should enable large pages, as it will generally yield better performance (in terms of throughput, latency and start up time) and comes with no real disadvantage, except that it's slightly more complicated to setup. (I have not been able to configure huge pages at all, but I think this may be a consequence of running in Virtualbox.)

-XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:+UseLargePages

If you still have a high allocation rate, you may want to set XX:ConcGCThreads so that the application does not allocate garbage faster than the GC can collect it.

For Shenandoah, you should pick an appropriate heuristic:

-XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC -XX:ShenandoahGCHeuristics=<name>

There are some great presentations on Shenandoah in particular that explain the reasons behind various GC decisions and the implications.

Logging vs Business Events

Logging is always a side-effect, so we should always be able to throw them away and reduce logging. If you lose logs, it may be annoying but never business-impacting.

If the logs are being sent into data analysis like Apache Spark or Databricks, which then determines what the business should do – that's not logging. That's a critical business event. If you absolutely can't afford to lose any of them, you should be using something other than SLF4J, and should be thinking about Kafka with Avro or Parquet, and scaling horizontally to ensure that you can capture everything.

Conditional Logging

When the volume of logging is just too damn high, then it's time to start adding conditional logging.

Set up a Condition for logging, and set up some targeting so that you can turn the logging on and off at will – you can do this using JMX or with a feature flag solution.

Or, you can go with a dynamic sampling approach, and only log unusual events at a higher resolution.

And then hook in the MeterThread as a Condition so that logging is conditional on memory pressure.

Diagnostic Logging

For diagnostic logging (DEBUG and INFO), you may want to apply the same trick of using an tmpfs mount and log to JSON using a RollingFileAppender.

This will avoid the cost of writing to disk and give you something like a bounded in-memory ring buffer, but with the advantage that you can access it from external processes and can copy it to disk in the event of an error. Since this is happening in logback, you can use zstandard directly with the CompressingFileAppender, and you can even use dictionary compression on each entry to leverage greater space savings.

Serialization

If logging happens and you want all of it, then you're going to have to minimize the object allocation costs of logging.

Blindsight comes with an integration to Logback through logstash-logback-encoder, which serializes using Jackson. Jackson is surprisingly efficient at serialization, but it's not completely garbage free. You can log to a binary format like SMILE or CBOR, but it's not clear that it's any faster. Given that you can parse gigabytes of JSON per second, it's not like the format itself is inherently inefficient.

One alternative might be to use DSL-JSON as a backend, which has some impressive benchmarks and reportedly almost zero allocations.

Because Blindsight is written against SLF4J, you might think that using Log4J 2 in Garbage Free was an option. The problem is that JSONLayout is not a supported garbage free layout, so it doesn't work with structured logging.

For me, structured logging in JSON is the ultimate killer feature. The power of having one and only one data format – JSON – that you can throw at any processing tool chain is an overwhelming advantage. Any custom made low level binary tracing format (CTF, etc) pales in comparison to the network effect advantages that JSON gives as a format.

Off Heap Tools

In the worst case, you can use frugal memory management techniques, and leverage Chronicle Queue and MapDB which have off-heap implementations that can be useful for managing super large JSON documents. I have no experience with this, and am mildly terrified at the prospect.

Conclusion

If you're running into odd behavior in your application, you could have excessive memory allocation, and some of that could be caused by logging. Blindsight has features that ensure logging only happens when and how you want, with minimal overhead.

Next blog post, I'm going to put together a demo application that puts some of these ideas into practice and shows the results. Wish me luck.

scala blindsight memory