pageserver: concurrent IO on the get page read path #9378

VladLazar · 2024-10-14T10:48:23Z

Problem

The read path does sequential IOs.

We should issue IO concurrently within one layer visit and to move on to the next layer without waiting for IOs
from the previous visit to complete.

Tasks & Milestones

hackathon prototype: [HACKATHON]: superscalar page_service / get_vectored #9002
initial PR pageserver: issue concurrent IO on the read path #9353
test in staging & preprod, together with direct IO
- remaining direct IO impact will be on code that is sequential at a higher level (collect_keyspace is a good example)
- only part on the user data path that isn't concurrent are the Timeline::get calls made by basebackup, use "show callers" to get an idea of it; speculation is we can get away with taking the direct IO latency hit for these
move IoConcurrency plumbing to RequestContext & integrate with TenantHarness #10460
mini retroactive RFC

Impl

Give feedback

initial logical size calculation: always poll to completion #10471
pageserver: issue concurrent IO on the read path #9353
basebackup: use Timeline::get for get_rel instead of get_rel_page_at_lsn #10476
WIP: concurrent IO benchmarking #10466
https://github.com/neondatabase/infra/pull/2836
Options

The text was updated successfully, but these errors were encountered:

problame · 2025-01-15T18:44:12Z

Status update:

impl is in pageserver: issue concurrent IO on the read path #9353
Vlad started, I took over
today, I merged in fix(page_service pipelining): tenant cannot shut down because gate kept open while flushing responses #10386
because that PR is going to get merged and deployed first; conflicts around protocol version v3 (super high confidence the merge was correct, and @skyzh 's feat(pageserver): support inherited sparse keyspace #10313 )
got rid of the barrier stage, as discussed with Vlad before the holidays
that appears to have fixed another mysterious hang bug (?)
removed todo() and deprecation warning ; just rename it to serial() and remove the warning?

Next steps:

clean up deep_layers_with_delta.py script / make it reusable and use in batching benchmark
clean up ad-hoc changes made to benchmarks (new params are ok, removing param space not)
convince ourselves that the current design works fine with limited slots, because the background task advances the io futures ( => don't need DRAFT: remove slots concepts, allocate for each IO instead tokio-epoll-uring#63 )
review interaction with sparse keyspace
- sparked by the minor merge conflicts with feat(pageserver): support inherited sparse keyspace #10313
- but really, I don't think either vlad or me thought about this at all
review teardown / VirtualFile lifecycle
- basically: because the VirtualFile is now wrapped in an Arc<> that lives inside the spawned ios, thus potentially longer than the get_values_reconstruct_data call that spawned the io, can it happen that compaction deletes the file from the filesystem while the Arc still exists?
// TODO: IO futures are not failable - we could expect
- at a minimum, understand how exactly we're behaving if we bail on the second of three on_disk_values
figure out how to monitor impact in staging/prod wrt compaction debt
- maybe just do a big ingest before and after enabling it?

problame · 2025-01-17T14:33:34Z

Had a call with Vlad:

convince ourselves that the current design works fine with limited slots, because the background task advances the io futures ( => don't need neondatabase/tokio-epoll-uring#63 )

We don't need it because each of the IO futures only uses one tokio-epoll-uring slot at a time, i.e., there's never a "hold tokio-epoll-uring slot and try to acquire another one".

However, that's totally non-obvious and thus brittle.
Should fix that after we have removed the Serial implementation and transition away from opaque IO futures to a well-defined operation that's like "do one vectored read against this VirtualFile and collect the results into this oneshot".

review interaction with sparse keyspace

We convinced ourselves that conurrent IO and sparse keyspace are orthogonal.

The intersection between the two is only in the traversal code around update_key.

neon/pageserver/src/tenant/storage_layer.rs

Lines 415 to 456 in 8c5005f

    
               /// Update the state collected for a given key. 
        
               /// Returns true if this was the last value needed for the key and false otherwise. 
        
               /// 
        
               /// If the key is done after the update, mark it as such. 
        
               /// 
        
               /// If the key is in the sparse keyspace (i.e., aux files), we do not track them in 
        
               /// `key_done`. 
        
               pub(crate) fn update_key( 
        
                   &mut self, 
        
                   key: &Key, 
        
                   lsn: Lsn, 
        
                   completes: bool, 
        
                   value: tokio::sync::oneshot::Receiver<Result<OnDiskValue, std::io::Error>>, 
        
               ) -> ValueReconstructSituation { 
        
                   let state = self.keys.entry(*key).or_default(); 
        
                   let is_sparse_key = key.is_sparse(); 
        
                   match state.situation { 
        
                       ValueReconstructSituation::Complete => { 
        
                           if is_sparse_key { 
        
                               // Sparse keyspace might be visited multiple times because 
        
                               // we don't track unmapped keyspaces. 
        
                               return ValueReconstructSituation::Complete; 
        
                           } else { 
        
                               unreachable!() 
        
                           } 
        
                       } 
        
                       ValueReconstructSituation::Continue => { 
        
                           state.on_disk_values.push((lsn, value)); 
        
                       } 
        
                   } 
        
                   if completes && state.situation == ValueReconstructSituation::Continue { 
        
                       state.situation = ValueReconstructSituation::Complete; 
        
                       if !is_sparse_key { 
        
                           self.keys_done.add_key(*key); 
        
                       } 
        
                   } 
        
                   state.situation 
        
               }

Before, it was called after IO completed and used the Value to decide whether to mark state as complete + addd do keys_done yes/no .

Now, instead of a read Value, we have argument completes.

Looking at the code, it's equivalent.

review teardown / VirtualFile lifecycle

The current code extends the lifetime of the "raw" file abstractions too much, i.e.

Delta and Image layer's VirtualFile is extended beyond the lifetime of the DownloadedLayer
- This can cause eviction to happen, local file to be removed, and thus spawned IO to fail.
- This is a bug and needs to be fixed before merging.
InMemoryLayer's EphemeralFile lifetime is extended
- No bug potential because the EphemeralFile::drop is what does the deletion from disk.

Decision: hold DownloadedLayer and InMemoryLayer alive in the spawned IO futures, instead of the "raw" file abstractions VirtualFile / EphemeralFile. That fixes the bug AND expresses more clearly to future readers of the code the implications of the spawned io futures.

// TODO: IO futures are not failable - we could expect

Didn't get to this. Will DM with Vlad about it.

The other issues will be addressed post-merge when we experiment in staging:

new automated benchmarks utilizing L0 layer stack
slots performance implications
figure out how to monitor impact in staging/prod wrt compaction debt

problame · 2025-01-17T16:41:30Z

// TODO: IO futures are not failable - we could expect

The TODO was just an observation while writing the code.

Regarding error handling in general, what we want as part of the first cut of this PR is that when get_vectored_impl returns, even with an error, all IOs are done.
For now we can assume get_vectored_impl is polled to completion.

The above desired behavior is not the case right now.
We do poll to completion, but, collect_pending_ios will wrap both mpsc RecvError and IO errors into a PagestreamError::Other and bail.

IO errors were wrapped into PagestreamError::Other even before the current state of the concurrent IO PR, so continuing to do that is fine.

But, we really want to wait for the IOs to drain.

Ideas (I will think of something):

posion pill message to cancel all preceding IOs?
cancellation token / dropguard in the ValuesReconstructData?

## Refs - Epic: #9378 Co-authored-by: Vlad Lazar <vlad@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> ## Problem The read path does its IOs sequentially. This means that if N values need to be read to reconstruct a page, we will do N IOs and getpage latency is `O(N*IoLatency)`. ## Solution With this PR we gain the ability to issue IO concurrently within one layer visit **and** to move on to the next layer without waiting for IOs from the previous visit to complete. This is an evolved version of the work done at the Lisbon hackathon, cf #9002. ## Design ### `will_init` now sourced from disk btree index keys On the algorithmic level, the only change is that the `get_values_reconstruct_data` now sources `will_init` from the disk btree index key (which is PS-page_cache'd), instead of from the `Value`, which is only available after the IO completes. ### Concurrent IOs, Submission & Completion To separate IO submission from waiting for its completion, while simultaneously feature-gating the change, we introduce the notion of an `IoConcurrency` struct through which IO futures are "spawned". An IO is an opaque future, and waiting for completions is handled through `tokio::sync::oneshot` channels. The oneshot Receiver's take the place of the `img` and `records` fields inside `VectoredValueReconstructState`. When we're done visiting all the layers and submitting all the IOs along the way we concurrently `collect_pending_ios` for each value, which means for each value there is a future that awaits all the oneshot receivers and then calls into walredo to reconstruct the page image. Walredo is now invoked concurrently for each value instead of sequentially. Walredo itself remains unchanged. The spawned IO futures are driven to completion by a sidecar tokio task that is separate from the task that performs all the layer visiting and spawning of IOs. That tasks receives the IO futures via an unbounded mpsc channel and drives them to completion inside a `FuturedUnordered`. (The behavior from before this PR is available through `IoConcurrency::Sequential`, which awaits the IO futures in place, without "spawning" or "submitting" them anywhere.) #### Alternatives Explored A few words on the rationale behind having a sidecar *task* and what alternatives were considered. One option is to queue up all IO futures in a FuturesUnordered that is polled the first time when we `collect_pending_ios`. Firstly, the IO futures are opaque, compiler-generated futures that need to be polled at least once to submit their IO. "At least once" because tokio-epoll-uring may not be able to submit the IO to the kernel on first poll right away. Second, there are deadlocks if we don't drive the IO futures to completion independently of the spawning task. The reason is that both the IO futures and the spawning task may hold some _and_ try to acquire _more_ shared limited resources. For example, both spawning task and IO future may try to acquire * a VirtualFile file descriptor cache slot async mutex (observed during impl) * a tokio-epoll-uring submission slot (observed during impl) * a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future) Another option is to spawn a short-lived `tokio::task` for each IO future. We implemented and benchmarked it during development, but found little throughput improvement and moderate mean & tail latency degradation. Concerns about pressure on the tokio scheduler made us discard this variant. The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct. However, 1. the opaque futures approach taken in this PR allows leaving the existing code unchanged, which 2. allows us to implement the `IoConcurrency::Sequential` mode for feature-gating the change. Once the new mode sidecar task implementation is rolled out everywhere, and `::Sequential` removed, we can think about a descriptive submission & completion interface. The problems around deadlocks pointed out earlier will need to be solved then. For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots. The latter has been drafted in neondatabase/tokio-epoll-uring#63. See the lengthy doc comment on `spawn_io()` for more details. ### Error handling There are two error classes during reconstruct data retrieval: * traversal errors: index lookup, move to next layer, and the like * value read IO errors A traversal error fails the entire get_vectored request, as before this PR. A value read error only fails that value. In any case, we preserve the existing behavior that once `get_vectored` returns, all IOs are done. Panics and failing to poll `get_vectored` to completion will leave the IOs dangling, which is safe but shouldn't happen, and so, a rate-limited log statement will be emitted at warning level. There is a doc comment on `collect_pending_ios` giving more code-level details and rationale. ### Feature Gating The new behavior is opt-in via pageserver config. The `Sequential` mode is the default. The only significant change in `Sequential` mode compared to before this PR is the buffering of results in the `oneshot`s. ## Code-Level Changes Prep work: * Make `GateGuard` clonable. Core Feature: * Traversal code: track `will_init` in `BlobMeta` and source it from the Delta/Image/InMemory layer index, instead of determining `will_init` after we've read the value. This avoids having to read the value to determine whether traversal can stop. * Introduce `IoConcurrency` & its sidecar task. * `IoConcurrency` is the clonable handle. * It connects to the sidecar task via an `mpsc`. * Plumb through `IoConcurrency` from high level code to the individual layer implementations' `get_values_reconstruct_data`. We piggy-back on the `ValuesReconstructState` for this. * The sidecar task should be long-lived, so, `IoConcurrency` needs to be rooted up "high" in the call stack. * Roots as of this PR: * `page_service`: outside of pagestream loop * `create_image_layers`: when it is called * `basebackup`(only auxfiles + replorigin + SLRU segments) * Code with no roots that uses `IoConcurrency::sequential` * any `Timeline::get` call * `collect_keyspace` is a good example * follow-up: #10460 * `TimelineAdaptor` code used by the compaction simulator, unused in practive * `ingest_xlog_dbase_create` * Transform Delta/Image/InMemoryLayer to * do their values IO in a distinct `async {}` block * extend the residence of the Delta/Image layer until the IO is done * buffer their results in a `oneshot` channel instead of straight in `ValuesReconstructState` * the `oneshot` channel is wrapped in `OnDiskValueIo` / `OnDiskValueIoWaiter` types that aid in expressiveness and are used to keep track of in-flight IOs so we can print warnings if we leave them dangling. * Change `ValuesReconstructState` to hold the receiving end of the `oneshot` channel aka `OnDiskValueIoWaiter`. * Change `get_vectored_impl` to `collect_pending_ios` and issue walredo concurrently, in a `FuturesUnordered`. Testing / Benchmarking: * Support queue-depth in pagebench for manual benchmarkinng. * Add test suite support for setting concurrency mode ps config field via a) an env var and b) via NeonEnvBuilder. * Hacky helper to have sidecar-based IoConcurrency in tests. This will be cleaned up later. More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod. Some intermediate helpers for manual benchmarking have been preserved in #10466 and will be landed in later PRs. (L0 layer stack generator!) Drive-By: * test suite actually didn't enable batching by default because `config.compatibility_neon_binpath` is always Truthy in our CI environment => https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309 * initial logical size calculation wasn't always polled to completion, which was surfaced through the added WARN logs emitted when dropping a `ValuesReconstructState` that still has inflight IOs. * remove the timing histograms `pageserver_getpage_get_reconstruct_data_seconds` and `pageserver_getpage_reconstruct_seconds` because with planning, value read IO, and walredo happening concurrently, one can no longer attribute latency to any one of them; we'll revisit this when Vlad's work on tracing/sampling through RequestContext lands. * remove code related to `get_cached_lsn()`. The logic around this has been dead at runtime for a long time, ever since the removal of the materialized page cache in #8105. ## Testing Unit tests use the sidecar task by default and run both modes in CI. Python regression tests and benchmarks also use the sidecar task by default. We'll test more in staging and possibly preprod. # Future Work Please refer to the parent epic for the full plan. The next step will be to fold the plumbing of IoConcurrency into RequestContext so that the function signatures get cleaned up. Once `Sequential` isn't used anymore, we can take the next big leap which is replacing the opaque IOs with structs that have well-defined semantics. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>

VladLazar added a/performance Area: relates to performance of the system c/storage Component: storage c/storage/pageserver Component: storage: pageserver labels Oct 14, 2024

VladLazar mentioned this issue Oct 14, 2024

Epic: get page throughput improvements #9376

Open

VladLazar self-assigned this Oct 14, 2024

problame assigned problame and unassigned VladLazar Nov 17, 2024

problame mentioned this issue Jan 16, 2025

pageserver: issue concurrent IO on the read path #9353

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: concurrent IO on the get page read path #9378

pageserver: concurrent IO on the get page read path #9378

VladLazar commented Oct 14, 2024 •

edited by problame

Loading

Impl

problame commented Jan 15, 2025

problame commented Jan 17, 2025

problame commented Jan 17, 2025 •

edited

Loading

pageserver: concurrent IO on the get page read path #9378

pageserver: concurrent IO on the get page read path #9378

Comments

VladLazar commented Oct 14, 2024 • edited by problame Loading

Problem

Tasks & Milestones

Impl

Impl

problame commented Jan 15, 2025

problame commented Jan 17, 2025

problame commented Jan 17, 2025 • edited Loading

VladLazar commented Oct 14, 2024 •

edited by problame

Loading

problame commented Jan 17, 2025 •

edited

Loading