-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: add per-timeline read amp histogram #10566
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check with infra team if they are fine with a per-timeline histogram, I remembered it overloaded the metrics system before. (i.e., it would take a minute to scrape pageserver metrics)
7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)Flaky tests (8)Postgres 17
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
8617a8d at 2025-01-30T11:32:48.632Z :recycle: |
d302445
to
f368c55
Compare
I split out the fix of the global |
## Problem The current global `pageserver_layers_visited_per_vectored_read_global` metric does not appear to accurately measure read amplification. It divides the layer count by the number of reads in a batch, but this means that e.g. 10 reads with 100 L0 layers will only measure a read amp of 10 per read, while the actual read amp was 100. While the cost of layer visits are amortized across the batch, and some layers may not intersect with a given key, each visited layer contributes directly to the observed latency for every read in the batch, which is what we care about. Touches neondatabase/cloud#23283. Extracted from #10566. ## Summary of changes * Count the number of layers visited towards each read in the batch, instead of the average across the batch. * Rename `pageserver_layers_visited_per_vectored_read_global` to `pageserver_layers_per_read_global`. * Reduce the read amp log warning threshold down from 512 to 100.
f368c55
to
8617a8d
Compare
To reduce the cardinality, I lowered the resolution here to:
The global metric has slightly better resolution:
I think that should be enough to give us a rough idea of timeline read amp. If the global metric shows e.g. >256, we can likely narrow it down to a handful of tenants with the per-timeline histogram. I'm going to merge this as-is, we can always change/remove it later if we need to. |
Problem
We don't have per-timeline observability for read amplification.
Touches https://github.com/neondatabase/cloud/issues/23283.
Summary of changes
Add a per-timeline
pageserver_layers_per_read
histogram.NB: per-timeline histograms are expensive, but probably worth it in this case.