Michael Christofides: Can auto_explain (with timing) have low overhead?

Introduction

auto_explain is a very cool PostgreSQL module that can be used to log the execution plans of your slowest queries automatically. It does this by tracking every query (or a sample of them), only logging the output for those slower than a threshold you set.

This post isn’t intended to be an introduction to auto_explain, but that could be a good future post. For now, if you’re new to it, I’d recommend reading the auto_explain section of the PostgreSQL documentation.

Here we’ll be looking into the overhead of auto_explain. Many places warn about its overhead, but I’ve found concrete information difficult to come by.

The log_timing parameter is generally cited as the main problem, but I have heard that the log_buffers can also have measurable overhead. Anecdotal numbers range from “around 3%” to “about a factor of 2” (quite the range) but I’ve not seen any benchmarking. 

As a result of these warnings, I suspect many don’t use auto_explain in production, or at least do so with log_timing disabled. But I have also seen serious organizations running auto_explain in production with log_timing on, which made me question whether the overhead could be relatively low in certain conditions.

Of course, collecting plans without the per-operation timing information can still be useful. Naturally it does limit you a bit when trying to work out why the query was slow though. We also currently require these per-operation timings in pgMustard, which gave me extra motivation to look into this.

Álvaro Hernández, the founder of PostgreSQL consulting firm Ongres, did a great job last year when he publicized the impact of a slow system clock on the overhead of EXPLAIN ANALYZE. Near the end of that post, he also noted that the overhead was different for different operations, using joins as an expensive example.   

As such, my hypothesis is that auto_explain.log_timing can have a low overhead 1) on a system with low system clock overhead, and 2) on a transactional workload.

It would be interesting to look into a variety of system clock overheads and workloads, but that will need to wait for another day.

Set up

This was my set-up for these tests:

  • PostgreSQL 13.2, installed via Postgres.app 2.4.2, on macOS 10.15.7

  • 2 cores, 4GB RAM, SSD

  • Changes to postgresql.conf:

    • shared_buffers = 1GB

    • effective_cache_size = 3GB

    • maintenance_work_mem = 256MB

  • pgbench

    • Default TPC-B like workload

    • 10x scale (~160MB) – less than shared_buffers

    • 4 clients – roughly optimal for transactions per second on this set-up

    • 1800 second (30 minute) test duration

  • pg_test_timing

For now, I prioritized using a simple system I was familiar with.

Here is the pg_test_timing result:

michael$ pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 90.61 ns
Histogram of timing durations:
  

According to the pg_test_timing documentation “Good results will show most (>90%) individual timing calls take less than one microsecond”, so it looks like we’re OK on that front.

Results

Baseline

After a fresh install of PostgreSQL, with those parameter changes, it was time to get a baseline to compare against.

michael$ pgbench -i -s 10
...

michael$ pgbench -c 4 -T 1800
...
latency average = 1.585 ms
tps = 2523.482861 (including connections establishing)
tps = 2523.488970 (excluding connections establishing)

Which gave me a baseline average latency of 1.585ms.

Log every query plan

auto_explain is off by default. It also has a log_min_duration parameter which defaults to -1 (log nothing). So it is normal to set log_min_duration to a threshold (in milliseconds) of query length you'd want to investigate. As a starting point, I wanted to test logging the execution plan for every query, so I set the min duration to 0. Note that we are not yet tracking timings, or any other execution details for that matter. The sample_rate parameter defaults to 1 (so 100% of queries will be in our “sample”), but I wanted to include it for clarity.

So I added the following to my postgresql.conf file and restarted:

shared_preload_libraries = auto_explain
auto_explain.sample_rate = 1
auto_explain.log_min_duration = 0

Note: the sample_rate parameter can be lowered to run auto_explain on a random sample of queries (e.g. 0.1 for 10%). I’d expect this to be a very effective way to reduce the overhead, at the cost of an incomplete picture.

I ran pgbench with the same settings as before:

michael$ pgbench -i -s 10
...
michael$ pgbench -c 4 -T 1800

...
latency average = 1.994 ms
tps = 2005.883159 (including connections establishing)
tps = 2005.887957 (excluding connections establishing)

This resulted in an average latency of 1.994ms, about 26% slower than our baseline. This was surprising to me, given the numbers I’d heard included tracking timings, and this did not! I believe it is due to the logging overhead on this system (default stderr). This test generated ~6GB of logs, which is another good reason to not log everything!

Log only the slowest 

I decided to try with a far more sensible threshold, this time only logging the output when the query execution time exceeds 10ms.

So the new settings are:

shared_preload_libraries = auto_explain
auto_explain.sample_rate = 1
auto_explain.log_min_duration = 10

And, after another restart:

michael$ pgbench -i -s 10
...
michael$ pgbench -c 4 -T 1800

...
latency average = 1.597 ms
tps = 2504.327824 (including connections establishing)
tps = 2504.334064 (excluding connections establishing)

This resulted in an average latency of 1.597ms, only 0.8% slower than our baseline of 1.585ms. For this workload, the threshold of 10ms meant we avoided logging ~99.9% of the query plans, only logging the absolute slowest queries and their plans (still a few hundred plans, since the test ran for half an hour). It is worth noting that explain details are still being tracked for every query, as it doesn’t know up-front whether the query will exceed our threshold (auto_explain source code).

Overall, this seemed to me a much more realistic starting point for the log_timing test.

The overhead of analyze without timing

Turning on log_analyze will track the EXPLAIN ANALYZE details for each query. By default, this would track timing on a per-operation basis, so to start with we’ll turn log_timing off.

Here are our new settings:

shared_preload_libraries = auto_explain
auto_explain.sample_rate = 1
auto_explain.log_min_duration = 10
auto_explain.log_analyze = true
auto_explain.log_timing = false

After another restart we can run our test again:

michael$ pgbench -i -s 10
...
michael$ pgbench -c 4 -T 1800
...

latency average = 1.590 ms
tps = 2515.130430 (including connections establishing)
tps = 2515.136821 (excluding connections establishing)

This resulted in an average latency of 1.590ms, only 0.3% slower than our baseline of 1.585ms. While this is marginally faster than the previous test, I suspect it is less than our margin of error, since we are now tracking more information (and it being faster doesn’t make any sense to me).

The overhead of timing

Let’s try the same again, but now with timing on:

shared_preload_libraries = auto_explain
auto_explain.sample_rate = 1
auto_explain.log_min_duration = 10
auto_explain.log_analyze = true
auto_explain.log_timing = true

After another restart:

michael$ pgbench -i -s 10
...
michael$ pgbench -c 4 -T 1800
...

latency average = 1.620 ms
tps = 2469.825217 (including connections establishing)
tps = 2469.831173 (excluding connections establishing)

This resulted in an average latency of 1.620ms, 2.2% slower than our baseline of 1.585ms, and 1.9% slower than without timing. This is below even the lowest anecdotal numbers I’d heard – interesting!

Bonus: the overhead of buffers, verbose, etc

Since I’d heard buffers could also have an overhead, I decided to look into that too. It turned out to not be measurable on this workload, so for brevity I have grouped it with every other setting available, in case any of them have a measurable impact. This includes switching to the (more verbose) JSON format of explain, that is currently required for reviewing plans in pgMustard. 

So the new auto_explain parameters in the postgresql.conf file are:

shared_preload_libraries = auto_explain
auto_explain.sample_rate = 1
auto_explain.log_min_duration = 10
auto_explain.log_analyze = true
auto_explain.log_timing = true
auto_explain.log_buffers = true
auto_explain.log_verbose = true
auto_explain.log_triggers = true
auto_explain.log_settings = true
auto_explain.log_nested_statements = true
auto_explain.log_format = JSON

After another restart we can run our test again:

michael$ pgbench -i -s 10
...
michael$ pgbench -c 4 -T 1800
...

latency average = 1.613 ms
tps = 2479.817711 (including connections establishing)
tps = 2479.823888 (excluding connections establishing)

This resulted in an average latency of 1.613ms, 1.7% slower than our baseline of 1.585ms. Similar to before, I can’t explain this being faster than the previous test, given it is measuring and logging more, so I think we can assume at least a few basis points of error margin. But overall, I was pleasantly surprised that buffers et al didn’t seem to add any measurable overhead on this workload. 

Summary

So, once we set a sensible minimum duration, the overhead of auto_explain on this small, transactional workload (with low system clock overhead), was less than 1% with timing off, and ~2% with timing on. This was without sampling, so details were tracked for every query, but only logged for the slowest.

The overhead was much higher, over 25%, when we ran auto_explain with a min_duration of 0ms (log everything), even without the analyze and timing parameters. I believe this was due to logging overhead.

Test Mean latency Overhead vs baseline
Baseline 1.585 ms
Min duration 0 1.994 ms 25.8%
Min duration 10 1.597 ms 0.8%
Analyze on, Timing off, Min 10 1.590 ms 0.3%
Analyze on, Timing on, Min 10 1.620 ms 2.2%
Everything on, Min 10 1.613 ms 1.8%

 

 

Conclusion and next steps

On this transactional workload, with low system clock overhead, the cost of auto_explain.log_timing was only around 2%. This was lower than I expected, given the warnings and numbers I’d heard, and I think would be considered a worthwhile investment for many. And remember, this was without sampling, so explain details were being tracked for every query.

Timing was the only parameter that I found to have measurable overhead on this system/workload, but remember we had low system-clock overhead, and a no complex queries.

As a next step, it would also be good to check these results on a box that closer resembles what people are running in production. I think looking into the overhead on an analytical workload would also be very interesting, as would trying to measure the impact of different system clock overheads. Finally, I believe that a large reduction to the sample_rate would result in a large reduction in the overhead (at the cost of missing out on a corresponding proportion of the slow query plans), but this would be good to measure.

It would be great to hear which, if any of those, people would be interested in seeing investigated next.

Many thanks to Nikolay Samokhvalov and Haki Benita for reviewing drafts of this post.


Check out our newsletter for a round up of our favourite PostgreSQL performance related blog posts and videos each month.


PostgreSQL