-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storcon: track safekeepers in memory, send heartbeats to them #10583
base: main
Are you sure you want to change the base?
Conversation
7414 tests run: 7055 passed, 6 failed, 353 skipped (full report)Failures on Postgres 17
Failures on Postgres 14
Flaky tests (6)Postgres 17
Test coverage report is not availableThe comment gets automatically updated with the latest test results
31b3e8b at 2025-01-30T13:02:09.835Z :recycle: |
I'm looking at the code, but just noting in advance that heatbeats are somewhat premature at this point; it would be ok to choose sks only by timelines count, we are going to trigger migrations initially only manually anyway. Implementing thing which tracks which nodes are alive and which not and acts automatically is a separate not trivial task; e.g. if sk is down during deploy for a couple of minutes it is not necessary to move timelines out of it. |
Yeah at the start it makes sense to use the heartbeat information only for new timelines, migrating away from a safekeeper that's down can come later. The main reason why I'm sending the heartbeats is so that I can obtain the per-safekeeper timeline count, because the earlier method (doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I tried to do hearbeats from scratch, I'd rather make it more straightforward, without two tasks and communications between them. But we're bolting on top of existing code, okay.
heartbeat.rs would be much nicer if it had top level (or Heartbeater
) comment describing what it is about and its API: spawns a task which accepts hb requests, explain AvailablityDeltas etc. In particular AvailablityDeltas meaning is unobvious. It is not returned for node if it was and is offline, but it is returned if it was up and is again up (which is reasonable given that we collect some stats which change, but again you need to look more closely to grasp this).
The main reason why I'm sending the heartbeats is so that I can obtain the per-safekeeper timeline count, because the earlier method
Ok, I see. We could also load these stats once on startup (from storcon db, no need to asks safekeepers) and then update in memory counters. But either works.
I submitted some notes, but they are not critcial.
max_retries: u32, | ||
timeout: Duration, | ||
cancel: &CancellationToken, | ||
) -> Option<mgmt_api::Result<T>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confusing that cancel might result in either None or in Error::cancelled.
utilization, | ||
} | ||
} else { | ||
SafekeeperState::Offline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If with_client_retries returned 'fatal' error, error itself is not logged anywhere because retry
logs only non permanent errors. Here we don't care much, but generally it is not good.
None => { break; } | ||
} | ||
}, | ||
_ = self.cancel.cancelled() => { return Err(HeartbeaterError::Cancel); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant because outer task loop already checks cancellation. Not sure if we have general policy here, but I like avoiding redundant code because it encourages you to understand the task hierarchy, which is important.
} | ||
|
||
tracing::info!( | ||
"Heartbeat round complete for {} safekeepers, {} offline", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is noisy and not really informative. Logging all safekeepers state whenever something changes might be a reasonable compromise.
|
||
let mut deltas = Vec::new(); | ||
let now = Instant::now(); | ||
for (node_id, ps_state) in new_state.iter_mut() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps_state
pub(crate) fn set_availability(&mut self, availability: SafekeeperState) { | ||
self.availability = availability; | ||
} | ||
pub(crate) async fn with_client_retries<T, O, F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be nice to have short comment what it does, i.e. run sk http request with retries.
@@ -206,6 +209,8 @@ struct ServiceState { | |||
|
|||
nodes: Arc<HashMap<NodeId, Node>>, | |||
|
|||
safekeepers: Arc<HashMap<NodeId, Safekeeper>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Arc here? Without it we can update in place which is easier sometimes.
#[derive(Clone)] | ||
pub struct Safekeeper { | ||
pub(crate) skp: SafekeeperPersistence, | ||
cancel: CancellationToken, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_client_retries accepts token separately, let's remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the token of with_client_retries
is I think for when the higher level request gets aborted. This cancellation token is for when the safekeeper object gets deleted (or the ps shuts down).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but if nothing uses it yet I'd remove it to avoid confusion
); | ||
} else { | ||
// TODO support updates | ||
tracing::warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll see these warns on each deploy, let's fix it (fairly easy) or silence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think I'll need to implement this anyway to make the tests pass. the issue is that arbitrary pieces of data can be changed, which is a bit scary... but whatever, it's possible to write that code. And apparently it's needed.
} | ||
} | ||
|
||
let mut offline = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: .filter(...).count(...)
In #9011, we want to schedule timelines to safekeepers. In order to do such scheduling, we need information about how utilized a safekeeper is and if it's available or not.
Therefore, send constant heartbeats to the safekeepers and try to figure out if they are online or not.
Includes some code from #10440.