Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: timeline table, deletion and creation #10440

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Jan 18, 2025

This PR adds an optional setting to the storage controller's timeline creation endpoint to also create them on the safekeepers.

In order to support individual safekeepers going offline, we immediately return from the http endpoint if news from a quorum of safekeepers reaches us that they have successfully created the timeline.

We introduce a new safekeeper reconciler task to the storage controller which reconciles outstanding timelines. For durably storing which timelines to reconcile, it uses two specific columns in the newly added timelines table: status_kind and status. Former is for quickly determining which status we want to do reconciliations for (we ask the db to maintain an index of it so that we can query specific states), latter is for status specific metadata in the json format.

TODO until ready for review:

  • notification API
  • tests

For follow-ups:

  • at tenant deletions, issue timeline deletion directly, instead of going via the reconciler: block on all timeline deletions completing enough (at least one SK returns success)
  • race-condition-free state setting function for timeline
  • cache safekeepers list instead of reloading them from the db every time
  • use utilization endpoint instead of COUNT(*) for determining target safekeepers
  • pruning task to delete timelines with old enough deleted_at

Part of #9011

@arpad-m arpad-m force-pushed the arpad/sk_timelines_schema branch from 7fe0060 to 361f491 Compare January 21, 2025 14:55
Copy link

github-actions bot commented Jan 21, 2025

7403 tests run: 7016 passed, 0 failed, 387 skipped (full report)


Flaky tests (6)

Postgres 17

Postgres 16

Postgres 14

Code coverage* (full report)

  • functions: 33.3% (8495 of 25545 functions)
  • lines: 49.0% (71476 of 145859 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
c660d78 at 2025-01-27T20:03:25.839Z :recycle:

@arpad-m arpad-m force-pushed the arpad/sk_timelines_schema branch from 46ddaaa to 0ec5046 Compare January 21, 2025 18:48
@arpad-m arpad-m force-pushed the arpad/sk_timelines_schema branch from 9ddb0dc to 5bdda19 Compare January 22, 2025 17:04
github-merge-queue bot pushed a commit that referenced this pull request Jan 22, 2025
Add APIs for timeline creation and deletion to the safekeeper client
crate. Going to be used later in #10440.

Split off from #10440.

Part of #9011
@arpad-m arpad-m force-pushed the arpad/sk_timelines_schema branch from a7bd98b to 7ec08ee Compare January 22, 2025 23:28
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a cursory look to prepare for tomorrow's meeting.

Comment on lines 3574 to 3577
// Can't do return Err because of async block, must do ? plus unreachable!()
return Err(ApiError::InternalServerError(anyhow!(
"import pgdata doesn't specify the start lsn, aborting creation on safekeepers"
)))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I will remove it, it was from an earlier state where there was an async block instead of an async fn

Comment on lines +3553 to +3554
// notify cplane about creation
// TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This todo indicates that this method is in the wrong place.

Really, this method belongs into the impl Reconciler, where we have self.compute_hook.notify(ShardUpdate{...}).
Which today calls cplane, but, architecturally, cplane doesn't actually need to know which safekeepers a compute should connect to; cplane only needs to funnel through the ShardUpdate to the compute. It could/should be an opaque blob to cplane.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I've been thinking about extending compute_hook for this purpose because that one already more or less has a connection to the control plane.

As for where to put the hook call, I think we can already call the hook the moment a quorum of safekeepers is reached. How to determine that is a good question, I suppose we can run db queries for that. But it needs to be something that is race condition free i.e. we don't want two reconcilers finishing in parallel both issue the reconcile (or the opposite, two reconcilers finishing in parallel not issue reconciles).

tenant_id: TenantId,
timeline_info: &TimelineInfo,
create_mode: models::TimelineCreateRequestMode,
) -> Result<(u32, Vec<NodeId>), ApiError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naked u32 should be wrapped in a newtype so it's clear what it means.

Had to jump to caller to learn it's the safekeeper generation.

Generation numbers are super important to get right, correctness of the service depends on them.

Comment on lines +292 to +294

pub safekeepers: Option<Vec<NodeId>>,
pub safekeepers_generation: Option<u32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would actually be better to use the same data structure here that the reconciler upcall will use (the TODO: notify cplane in tenant_timeline_create_safekeepers_reconcile)

Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting superficial things on a first glance

@@ -280,6 +280,18 @@ pub struct TimelineCreateRequest {
pub new_timeline_id: TimelineId,
#[serde(flatten)]
pub mode: TimelineCreateRequestMode,
/// Whether to also create timeline on the safekeepers (specific to storcon API)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to have this as a global command line argument -- it avoids cplane change and it doesn't seem useful to enable this only for some timelines.

#[serde(flatten)]
pub timeline_info: TimelineInfo,

pub safekeepers: Option<Vec<NodeId>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1135,6 +1198,189 @@ impl Persistence {
})
.await
}

/// Timelines must be persisted before we schedule them for the first time.
pub(crate) async fn insert_timeline(&self, entry: TimelinePersistence) -> DatabaseResult<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it upsert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really an upsert call though, it only inserts. If the timeline already exists, it returns an error.

I think an upsert like functionality is dangerous, especially for something that happens regularly. upsert APIs are prone to race conditions.

That being said, we should return something more nice than an error if the entry already exists, so that we can use this information to make higher level calls idempotent.

.do_nothing()
.execute(conn)?;

if inserted_updated != 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is inserted_updated if DO NOTHING happened?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it's 0. yeah, this case should probably be improved.


macro_rules! measured_request {
($name:literal, $method:expr, $node_id: expr, $invoke:expr) => {{
let labels = crate::metrics::PageserverRequestLabelGroup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PageserverRequestLabelGroup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants