Application Logging in Java: Logging Costs

This is part of the series of blog posts:

There are several questions that always get asked about logging.

  1. Will logging slow down my computer?
  2. How much latency does logging add?
  3. How much I/O does logging add?
  4. How much CPU does logging add?

These are interesting questions, so I put together a set of benchmarks using JMH. They're also the wrong questions.

The right questions to ask are:

  1. How much money is my logging costing me?
  2. Knowing the budget for logging, can I justify logging at INFO here?
  3. Knowing that it will cause alerts, can I justify logging at WARN here?
  4. Knowing that it will cause someone to get paged at 3 am, can I justify logging at ERROR here?

I will explain the above questions in more detail, but first let's answer the performance questions.

Microbenchmarking is typically done with JMH, but must be done carefully. The Log4J 2 results, for example, use several bad JMH practices that cause the benchmarks to be incorrect – the code has been corrected by this PR, but the performance page has not been updated with the corrected benchmarks, and still includes references to bugs fixed in 2017. On top of that, the benchmarks are only down to the millisecond level, and do not reflect real world conditions.

So I've put together another SLF4J Benchmark that avoids bad JMH practices and focuses on what happens in most code. I put together the benchmark because I'm not interested in the question of how much logging is possible. I'm interested in the base case. Let's assume that we're not I/O or GC bound, CPU utilization is less than 70% – it's a normal, boring, ordinary application. How much CPU time do we burn in logging?

I ran these tests on a Dell XPS 15 9560 using a Intel® Core™ i7-7700HQ CPU @ 2.80GHz × 8 and a standard SSD. It's a nice laptop, but it's still a laptop.

For Logback, when logging is disabled:

  • Using a conditional statement such as if (logger.isDebugEnabled) { … } takes 1.6 nanoseconds.
  • A raw statement such as logger.debug("hello world") takes 1.8 nanoseconds.
  • A statement that uses string interpolation or string logger.debug("hello " + name) takes 60 nanoseconds.

For Logback, when logging is enabled, CPU time depends heavily on the appender:

  • With a no-op appender, logging takes between 24 and 84 nanoseconds.
  • With a disruptor based async appender logging to no-op, between 150 and 350 nanoseconds.
  • With a straight file appender with immediateFlush=false, between 636 and 850 nanoseconds.

For Log4J 2, CPU time also depends on the appender:

  • With a no-op appender, logging takes between 135 and 244 nanoseconds.
  • With a disruptor based async appender logging to no-op, between 860 and 1047 nanoseconds.
  • With a straight file appender with bufferedIO=true and immediateFlush=false, between 307 and 405 nanoseconds.

There's no huge difference between Log4J 2 and Logback. 1000 nanoseconds is 0.001 millisecond. A decent HTTP response takes around 70 - 100 milliseconds, and a decent HTTP framework will process around 10K requests a second on an AWS c4 instance. If you're using event based logging, then you'll generate 10K logging events per second, per instance, and then you'll also have a smattering of errors and warnings on top of that in production. Logging is not a significant source of latency here.

Even when not contained in a conditional, the overhead is trivial in normal program flow compared to other common operations like switching between threads – see Operation Cost in CPU Cycles for more details.

However, if we wanted to log as much as possible, how much? (How much log could a logger log, if a logger could log logs?)

The appenders are measured in terms of throughput, rather than latency.

For logback:

  • A file appender with immediateFlush=false can perform ~1789 ops/ms.
  • A disruptor based async appender can perform ~3677 ops/ms against a no-op appender.

I'll skip the benchmarks and discussion about the ideal configuration, and just say you should use an async appender to wrap writing to file or network, and add a shutdown hook to close it safely.

But let's look at what that throughput implies.

A file appender, spitting out "hello world" to a log file at a rate of 1789 ops/ms will generate roughly 56 GiB of data in 5 minutes, or 184.32 MiB per second.

❱ ls -lah testFile.log          
-rw-r--r-- 1 wsargent wsargent 56G May 25 13:52 testFile.log

Processing that log file by counting the words in it also takes minutes.

❱ wc testFile.log
547633050  2738165250 59144369400 testFile.log

Let's be real here: you're not going to read through a 56 GiB file. You won't find a file editor to open it. You're probably not going to grep data in it. If you have 56 GiB of data, you're going to upload it somewhere and have the logging tool you have index it so you can search it later.

So, given that a laptop can generate 56 GiB of data in five minutes… say that's reasonable. Say that we generated 1 GB of logging a minute, on a server class machine that's writing to network. There are 1440 minutes in a day. That's 1.44 TB.

This is where we get into asking the questions of budget.

  • Loggly has pricing for $1285 a month, you can store 20 GB of data for 15 days.
  • For Rapid7, 300 GB of data a month will cost you $433.
  • Using Papertrail, 1500 GB of data with a 3 day searchable window will cost you $2,480 a month.
  • Sumo Logic pricing will cost $108 per GB / month and give you 30 GB of data.
  • Logz.io says a daily volume of 100 GB per day costs $4600 a month.
  • logdna costs $1.50/GB per month with 7 day retention, so $2160 a month for 1.44 TB
  • Splunk pricing is… hard to track down.

Amazon Kinesis Firehose to S3 pricing is another possible option, but you'll have to figure out the costs yourself. Pricing for Firehose will cost $0.029 per GB and then you'll store those files in S3 at a rate of $0.023 per GB, then you can run Amazon ElasticSearch at a rate of $0.283 per hour for an m5.xlarge to be able to search those logs. Doable. Maybe.

But this is for a single machine. And one visible trend in computing is that you can have multiple machines, all doing centralized logging. If you are logging indiscriminately, and are running in the cloud where you can scale up or down in response to demand, then that number can go up or down at unexpected rates.

Anticipating some of the callbacks: yes, this is dealing with raw uncompressed logs, because that's what most of the solutions deal with. And yes, you can ameliorate these costs by running your own ELK stack, but at that point you have the cost of administration to deal with.

So, logging is free. But only for the developer. All the incurred costs of logging are downstream from the developer, and logging statements at INFO level can add up fast.

This doesn't even get into the human costs of logging at WARN and ERROR levels. Spurious warnings will cause operations to ignore those warnings and possibly ignore grey failure. Spurious errors may cause pagers to go off, and cost the company in time and energy. Logging levels have human costs.

Logging is free. Logs are expensive.