storcon: track safekeepers in memory, send heartbeats to them #10583

arpad-m · 2025-01-30T12:00:45Z

In #9011, we want to schedule timelines to safekeepers. In order to do such scheduling, we need information about how utilized a safekeeper is and if it's available or not.

Therefore, send constant heartbeats to the safekeepers and try to figure out if they are online or not.

Includes some code from #10440.

github-actions · 2025-01-30T13:02:10Z

7414 tests run: 7055 passed, 6 failed, 353 skipped (full report)

Failures on Postgres 17

test_safekeeper_deployment_time_update: release-x86-64-with-lfc, release-x86-64-without-lfc, debug-x86-64-without-lfc, release-arm64-with-lfc, release-arm64-without-lfc

Failures on Postgres 14

test_scrubber_tenant_snapshot[4]: release-arm64-with-lfc

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_scrubber_tenant_snapshot[release-pg14-4] or test_safekeeper_deployment_time_update[release-pg17] or test_safekeeper_deployment_time_update[release-pg17] or test_safekeeper_deployment_time_update[debug-pg17] or test_safekeeper_deployment_time_update[release-pg17] or test_safekeeper_deployment_time_update[release-pg17]"

Flaky tests (6)

Postgres 17

test_pgdata_import_smoke[None-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: debug-x86-64-without-lfc, release-arm64-with-lfc, release-arm64-without-lfc
test_pgdata_import_smoke[8-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: debug-x86-64-without-lfc, release-arm64-with-lfc, release-arm64-without-lfc

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
31b3e8b at 2025-01-30T13:02:09.835Z :recycle:}

arssher · 2025-01-31T10:49:04Z

I'm looking at the code, but just noting in advance that heatbeats are somewhat premature at this point; it would be ok to choose sks only by timelines count, we are going to trigger migrations initially only manually anyway.

Implementing thing which tracks which nodes are alive and which not and acts automatically is a separate not trivial task; e.g. if sk is down during deploy for a couple of minutes it is not necessary to move timelines out of it.

arpad-m · 2025-01-31T11:41:38Z

Yeah at the start it makes sense to use the heartbeat information only for new timelines, migrating away from a safekeeper that's down can come later.

The main reason why I'm sending the heartbeats is so that I can obtain the per-safekeeper timeline count, because the earlier method (doing COUNT(*) in the timelines table) had been criticized by @problame .

arssher

If I tried to do hearbeats from scratch, I'd rather make it more straightforward, without two tasks and communications between them. But we're bolting on top of existing code, okay.

heartbeat.rs would be much nicer if it had top level (or Heartbeater) comment describing what it is about and its API: spawns a task which accepts hb requests, explain AvailablityDeltas etc. In particular AvailablityDeltas meaning is unobvious. It is not returned for node if it was and is offline, but it is returned if it was up and is again up (which is reasonable given that we collect some stats which change, but again you need to look more closely to grasp this).

The main reason why I'm sending the heartbeats is so that I can obtain the per-safekeeper timeline count, because the earlier method

Ok, I see. We could also load these stats once on startup (from storcon db, no need to asks safekeepers) and then update in memory counters. But either works.

I submitted some notes, but they are not critcial.

arssher · 2025-01-31T10:43:50Z

storage_controller/src/safekeeper.rs

+        max_retries: u32,
+        timeout: Duration,
+        cancel: &CancellationToken,
+    ) -> Option<mgmt_api::Result<T>>


It is confusing that cancel might result in either None or in Error::cancelled.

arssher · 2025-01-31T11:34:31Z

storage_controller/src/heartbeater.rs

+                            utilization,
+                        }
+                    } else {
+                        SafekeeperState::Offline


If with_client_retries returned 'fatal' error, error itself is not logged anywhere because retry logs only non permanent errors. Here we don't care much, but generally it is not good.

arssher · 2025-01-31T11:38:07Z

storage_controller/src/heartbeater.rs

+                            None => { break; }
+                        }
+                    },
+                    _ = self.cancel.cancelled() => { return Err(HeartbeaterError::Cancel); }


Redundant because outer task loop already checks cancellation. Not sure if we have general policy here, but I like avoiding redundant code because it encourages you to understand the task hierarchy, which is important.

arssher · 2025-01-31T11:42:52Z

storage_controller/src/heartbeater.rs

+        }
+
+        tracing::info!(
+            "Heartbeat round complete for {} safekeepers, {} offline",


nit: this is noisy and not really informative. Logging all safekeepers state whenever something changes might be a reasonable compromise.

arssher · 2025-01-31T11:58:20Z

storage_controller/src/heartbeater.rs

+
+        let mut deltas = Vec::new();
+        let now = Instant::now();
+        for (node_id, ps_state) in new_state.iter_mut() {


arssher · 2025-01-31T13:16:21Z

storage_controller/src/safekeeper.rs

+    pub(crate) fn set_availability(&mut self, availability: SafekeeperState) {
+        self.availability = availability;
+    }
+    pub(crate) async fn with_client_retries<T, O, F>(


nit: would be nice to have short comment what it does, i.e. run sk http request with retries.

arssher · 2025-01-31T13:33:51Z

storage_controller/src/service.rs

@@ -206,6 +209,8 @@ struct ServiceState {

    nodes: Arc<HashMap<NodeId, Node>>,

+    safekeepers: Arc<HashMap<NodeId, Safekeeper>>,


Why Arc here? Without it we can update in place which is easier sometimes.

arssher · 2025-01-31T14:51:56Z

storage_controller/src/safekeeper.rs

+#[derive(Clone)]
+pub struct Safekeeper {
+    pub(crate) skp: SafekeeperPersistence,
+    cancel: CancellationToken,


with_client_retries accepts token separately, let's remove this

the token of with_client_retries is I think for when the higher level request gets aborted. This cancellation token is for when the safekeeper object gets deleted (or the ps shuts down).

ok, but if nothing uses it yet I'd remove it to avoid confusion

arssher · 2025-01-31T14:57:42Z

storage_controller/src/service.rs

+                );
+            } else {
+                // TODO support updates
+                tracing::warn!(


We'll see these warns on each deploy, let's fix it (fairly easy) or silence?

yeah I think I'll need to implement this anyway to make the tests pass. the issue is that arbitrary pieces of data can be changed, which is a bit scary... but whatever, it's possible to write that code. And apparently it's needed.

arssher · 2025-01-31T15:04:43Z

storage_controller/src/heartbeater.rs

+            }
+        }
+
+        let mut offline = 0;


nit: .filter(...).count(...)

arpad-m added 9 commits January 29, 2025 18:01

Make heartbeater generic

6dc29f8

Add SafekeeperClient

d1a6adb

Add heartbeater

9ff0ac9

Remove warming up struct

5f210a6

Put the SK heartbeater into Service

c6ea731

Allow some dead code

98118d0

Track safekeepers in database

7e39a04

Heartbeat regularly

5c8924b

Also do initial heartbeat

9f9f305

arpad-m requested review from arssher and VladLazar January 30, 2025 12:00

arpad-m requested a review from a team as a code owner January 30, 2025 12:00

Remove this allow

31b3e8b

arssher approved these changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storcon: track safekeepers in memory, send heartbeats to them #10583

storcon: track safekeepers in memory, send heartbeats to them #10583

arpad-m commented Jan 30, 2025

github-actions bot commented Jan 30, 2025

Postgres 17

arssher commented Jan 31, 2025

arpad-m commented Jan 31, 2025

arssher left a comment •

edited

Loading

arssher Jan 31, 2025

arssher Jan 31, 2025

arssher Jan 31, 2025

arssher Jan 31, 2025

arssher Jan 31, 2025

arssher Jan 31, 2025

arssher Jan 31, 2025

arssher Jan 31, 2025

arpad-m Jan 31, 2025

arssher Feb 5, 2025

arssher Jan 31, 2025

arpad-m Jan 31, 2025

arssher Jan 31, 2025

		@@ -206,6 +209,8 @@ struct ServiceState {

		nodes: Arc<HashMap<NodeId, Node>>,

		safekeepers: Arc<HashMap<NodeId, Safekeeper>>,

storcon: track safekeepers in memory, send heartbeats to them #10583

Are you sure you want to change the base?

storcon: track safekeepers in memory, send heartbeats to them #10583

Conversation

arpad-m commented Jan 30, 2025

github-actions bot commented Jan 30, 2025

7414 tests run: 7055 passed, 6 failed, 353 skipped (full report)

Failures on Postgres 17

Failures on Postgres 14

Postgres 17

Test coverage report is not available

arssher commented Jan 31, 2025

arpad-m commented Jan 31, 2025

arssher left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arssher left a comment •

edited

Loading