Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: track safekeepers in memory, send heartbeats to them #10583

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Jan 30, 2025

In #9011, we want to schedule timelines to safekeepers. In order to do such scheduling, we need information about how utilized a safekeeper is and if it's available or not.

Therefore, send constant heartbeats to the safekeepers and try to figure out if they are online or not.

Includes some code from #10440.

@arpad-m arpad-m requested review from arssher and VladLazar January 30, 2025 12:00
@arpad-m arpad-m requested a review from a team as a code owner January 30, 2025 12:00
Copy link

7414 tests run: 7055 passed, 6 failed, 353 skipped (full report)


Failures on Postgres 17

Failures on Postgres 14

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_scrubber_tenant_snapshot[release-pg14-4] or test_safekeeper_deployment_time_update[release-pg17] or test_safekeeper_deployment_time_update[release-pg17] or test_safekeeper_deployment_time_update[debug-pg17] or test_safekeeper_deployment_time_update[release-pg17] or test_safekeeper_deployment_time_update[release-pg17]"
Flaky tests (6)

Postgres 17

Test coverage report is not available

The comment gets automatically updated with the latest test results
31b3e8b at 2025-01-30T13:02:09.835Z :recycle:

@arssher
Copy link
Contributor

arssher commented Jan 31, 2025

I'm looking at the code, but just noting in advance that heatbeats are somewhat premature at this point; it would be ok to choose sks only by timelines count, we are going to trigger migrations initially only manually anyway.

Implementing thing which tracks which nodes are alive and which not and acts automatically is a separate not trivial task; e.g. if sk is down during deploy for a couple of minutes it is not necessary to move timelines out of it.

@arpad-m
Copy link
Member Author

arpad-m commented Jan 31, 2025

Yeah at the start it makes sense to use the heartbeat information only for new timelines, migrating away from a safekeeper that's down can come later.

The main reason why I'm sending the heartbeats is so that I can obtain the per-safekeeper timeline count, because the earlier method (doing COUNT(*) in the timelines table) had been criticized by @problame .

Copy link
Contributor

@arssher arssher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I tried to do hearbeats from scratch, I'd rather make it more straightforward, without two tasks and communications between them. But we're bolting on top of existing code, okay.

heartbeat.rs would be much nicer if it had top level (or Heartbeater) comment describing what it is about and its API: spawns a task which accepts hb requests, explain AvailablityDeltas etc. In particular AvailablityDeltas meaning is unobvious. It is not returned for node if it was and is offline, but it is returned if it was up and is again up (which is reasonable given that we collect some stats which change, but again you need to look more closely to grasp this).

The main reason why I'm sending the heartbeats is so that I can obtain the per-safekeeper timeline count, because the earlier method

Ok, I see. We could also load these stats once on startup (from storcon db, no need to asks safekeepers) and then update in memory counters. But either works.

I submitted some notes, but they are not critcial.

max_retries: u32,
timeout: Duration,
cancel: &CancellationToken,
) -> Option<mgmt_api::Result<T>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing that cancel might result in either None or in Error::cancelled.

utilization,
}
} else {
SafekeeperState::Offline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If with_client_retries returned 'fatal' error, error itself is not logged anywhere because retry logs only non permanent errors. Here we don't care much, but generally it is not good.

None => { break; }
}
},
_ = self.cancel.cancelled() => { return Err(HeartbeaterError::Cancel); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant because outer task loop already checks cancellation. Not sure if we have general policy here, but I like avoiding redundant code because it encourages you to understand the task hierarchy, which is important.

}

tracing::info!(
"Heartbeat round complete for {} safekeepers, {} offline",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is noisy and not really informative. Logging all safekeepers state whenever something changes might be a reasonable compromise.


let mut deltas = Vec::new();
let now = Instant::now();
for (node_id, ps_state) in new_state.iter_mut() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps_state

pub(crate) fn set_availability(&mut self, availability: SafekeeperState) {
self.availability = availability;
}
pub(crate) async fn with_client_retries<T, O, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to have short comment what it does, i.e. run sk http request with retries.

@@ -206,6 +209,8 @@ struct ServiceState {

nodes: Arc<HashMap<NodeId, Node>>,

safekeepers: Arc<HashMap<NodeId, Safekeeper>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Arc here? Without it we can update in place which is easier sometimes.

#[derive(Clone)]
pub struct Safekeeper {
pub(crate) skp: SafekeeperPersistence,
cancel: CancellationToken,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_client_retries accepts token separately, let's remove this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the token of with_client_retries is I think for when the higher level request gets aborted. This cancellation token is for when the safekeeper object gets deleted (or the ps shuts down).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but if nothing uses it yet I'd remove it to avoid confusion

);
} else {
// TODO support updates
tracing::warn!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll see these warns on each deploy, let's fix it (fairly easy) or silence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think I'll need to implement this anyway to make the tests pass. the issue is that arbitrary pieces of data can be changed, which is a bit scary... but whatever, it's possible to write that code. And apparently it's needed.

}
}

let mut offline = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .filter(...).count(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants