Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: retry wrapper on manifest upload #10524

Merged
merged 2 commits into from
Jan 27, 2025

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Jan 27, 2025

Problem

On remote storage errors (e.g. I/O timeout) uploading tenant manifest, all of compaction could fail. This is a problem IRL because we shouldn't abort compaction on a single IO error, and in tests because it generates spurious failures.

Related: https://github.com/orgs/neondatabase/projects/51/views/2?sliceBy%5Bvalue%5D=jcsp&pane=issue&itemId=93692919&issue=neondatabase%7Cneon%7C10389

Summary of changes

  • Use backoff::retry when uploading tenant manifest

@jcsp jcsp requested a review from a team as a code owner January 27, 2025 17:49
@jcsp jcsp requested review from problame and arpad-m January 27, 2025 17:49
@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Jan 27, 2025
Copy link

github-actions bot commented Jan 27, 2025

7429 tests run: 7042 passed, 0 failed, 387 skipped (full report)


Flaky tests (6)

Postgres 17

Postgres 16

Postgres 14

Code coverage* (full report)

  • functions: 33.5% (8500 of 25351 functions)
  • lines: 49.3% (71493 of 145086 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
9b433cb at 2025-01-27T21:10:25.381Z :recycle:

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we could move the lock holding into the inner function, so that it's not held while there is the backoff waiting. But I don't think this will be much of an issue.

@jcsp
Copy link
Collaborator Author

jcsp commented Jan 27, 2025

Yeah, I considered moving the lock in, but wouldn't want to break the fairness between multiple callers by having one of them go to the back of the queue if it's unlucky enough to hit a remote storage glitch

@jcsp jcsp enabled auto-merge January 27, 2025 19:31
@jcsp jcsp added this pull request to the merge queue Jan 27, 2025
Merged via the queue into main with commit d73f4a6 Jan 27, 2025
84 checks passed
@jcsp jcsp deleted the jcsp/issue-10389-manifest-retries branch January 27, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants