pageserver: add per-timeline read amp histogram #10566

erikgrinaker · 2025-01-29T17:31:42Z

Problem

We don't have per-timeline observability for read amplification.

Touches https://github.com/neondatabase/cloud/issues/23283.

Summary of changes

Add a per-timeline pageserver_layers_per_read histogram.

NB: per-timeline histograms are expensive, but probably worth it in this case.

skyzh

Need to check with infra team if they are fine with a per-timeline histogram, I remembered it overloaded the metrics system before. (i.e., it would take a minute to scrape pageserver metrics)

github-actions · 2025-01-29T19:01:42Z

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)

Flaky tests (8)

Postgres 17

test_pageserver_gc_compaction_smoke[with_branches]: release-x86-64-without-lfc
test_pgdata_import_smoke[None-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc
test_pgdata_import_smoke[8-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: release-arm64-with-lfc, release-arm64-without-lfc, debug-x86-64-without-lfc
test_isolation[None]: release-x86-64-with-lfc

Code coverage* (full report)

functions: 33.4% (8511 of 25500 functions)
lines: 49.1% (71477 of 145523 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
8617a8d at 2025-01-30T11:32:48.632Z :recycle:}

erikgrinaker · 2025-01-29T21:35:56Z

I split out the fix of the global pageserver_layers_per_read_global metric to #10573, and kept this only for the per-timeline metric, to discuss if the cardinality is worth it. Will revisit tomorrow.

## Problem The current global `pageserver_layers_visited_per_vectored_read_global` metric does not appear to accurately measure read amplification. It divides the layer count by the number of reads in a batch, but this means that e.g. 10 reads with 100 L0 layers will only measure a read amp of 10 per read, while the actual read amp was 100. While the cost of layer visits are amortized across the batch, and some layers may not intersect with a given key, each visited layer contributes directly to the observed latency for every read in the batch, which is what we care about. Touches neondatabase/cloud#23283. Extracted from #10566. ## Summary of changes * Count the number of layers visited towards each read in the batch, instead of the average across the batch. * Rename `pageserver_layers_visited_per_vectored_read_global` to `pageserver_layers_per_read_global`. * Reduce the read amp log warning threshold down from 512 to 100.

erikgrinaker · 2025-01-30T10:21:27Z

To reduce the cardinality, I lowered the resolution here to:

[1.0, 5.0, 10.0, 25.0, 50.0, 100.0]

The global metric has slightly better resolution:

[1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, 1024.0]

I think that should be enough to give us a rough idea of timeline read amp. If the global metric shows e.g. >256, we can likely narrow it down to a handful of tenants with the per-timeline histogram.

I'm going to merge this as-is, we can always change/remove it later if we need to.

erikgrinaker requested review from problame and skyzh January 29, 2025 17:31

erikgrinaker requested a review from a team as a code owner January 29, 2025 17:31

erikgrinaker changed the title ~~pageserver: improve read amp metric~~ pageserver: improve layers per read metric Jan 29, 2025

skyzh approved these changes Jan 29, 2025

View reviewed changes

erikgrinaker mentioned this pull request Jan 29, 2025

pageserver: add pageserver_deltas_per_read_global metric #10570

Merged

erikgrinaker mentioned this pull request Jan 29, 2025

pageserver: improve read amp metric #10573

Merged

erikgrinaker force-pushed the erik/layers-per-read-metric branch from d302445 to f368c55 Compare January 29, 2025 21:34

erikgrinaker changed the base branch from main to erik/layers-per-read-global January 29, 2025 21:34

erikgrinaker changed the title ~~pageserver: improve layers per read metric~~ pageserver: add per-timeline read amp metric Jan 29, 2025

erikgrinaker changed the title ~~pageserver: add per-timeline read amp metric~~ pageserver: add per-timeline read amp histogram Jan 29, 2025

Base automatically changed from erik/layers-per-read-global to main January 30, 2025 09:35

pageserver: add per-timeline read amp metric

8617a8d

erikgrinaker force-pushed the erik/layers-per-read-metric branch from f368c55 to 8617a8d Compare January 30, 2025 10:09

erikgrinaker enabled auto-merge January 30, 2025 10:21

erikgrinaker added this pull request to the merge queue Jan 30, 2025

Merged via the queue into main with commit 6a2afa0 Jan 30, 2025
82 checks passed

erikgrinaker deleted the erik/layers-per-read-metric branch January 30, 2025 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: add per-timeline read amp histogram #10566

pageserver: add per-timeline read amp histogram #10566

erikgrinaker commented Jan 29, 2025 •

edited

Loading

skyzh left a comment •

edited

Loading

github-actions bot commented Jan 29, 2025 •

edited

Loading

Postgres 17

erikgrinaker commented Jan 29, 2025 •

edited

Loading

erikgrinaker commented Jan 30, 2025

pageserver: add per-timeline read amp histogram #10566

pageserver: add per-timeline read amp histogram #10566

Conversation

erikgrinaker commented Jan 29, 2025 • edited Loading

Problem

Summary of changes

skyzh left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 29, 2025 • edited Loading

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)

Postgres 17

Code coverage* (full report)

erikgrinaker commented Jan 29, 2025 • edited Loading

erikgrinaker commented Jan 30, 2025

erikgrinaker commented Jan 29, 2025 •

edited

Loading

skyzh left a comment •

edited

Loading

github-actions bot commented Jan 29, 2025 •

edited

Loading

erikgrinaker commented Jan 29, 2025 •

edited

Loading