[POC] Add support for automatically joining new OmniPaxos nodes #2478

tillrohrmann · 2025-01-09T14:59:40Z

This PR contains a first variant of an embedded metadata store based on OmniPaxos. The metadata store has a durable log and supports reconfiguration. The metadata store awaits provisioning. Once provisioned a single Restate node that runs the metadata store role acts as the metadata store cluster. As more nodes join the cluster, those that run the metadata store role will try to join the metadata store cluster by requesting a reconfiguration. Once enough metadata store nodes have joined one should be able to kill floor((n - 1)/2) nodes and things should continue working.

If you want to try things out, you can spawn three Restate server's with the configuration files you find here. Then you need to provision the cluster via

restatectl cluster provision --num-partitions 2 --bifrost-provider replicated --replication-property 2 --yes

After a short while you should see all nodes running the partition processors. If you see the log line Run as active metadata store node on every node, then the metadata store cluster should contain all nodes and you should be able to kill a single random node at your will.

Internally, the OmniPaxosMetadataStore works the following way:

Check whether we have an OmniPaxosConfiguration persisted to disk. If yes, then start as an active metadata store
Check whether a NodesConfiguration is known. If yes, then this indicates a prior provisioning. Start as passive metadata store
Await the provision signal to create the initial NodesConfiguration and start as active metadata store

When being a passive metadata store, then try to join an existing cluster by randomly asking any node that runs the metadata store node role. Joining a cluster entails a reconfiguration and sending of the whole log to the new joiner.

An active metadata store runs the OmniPaxos library as well as processing metadata store requests and join requests. A metadata store only reacts to requests if it is the leader. This is a simplification to avoid hanging requests because they are never committed.

The draft PR has still many rough edges. The main goal is to verify the overall direction and discuss the design decisions (some of them being quite questionable). I've tried to highlight the parts that reviewers should take a closer look at. The best way to look at the PR is probably to take a look at the full result of the OmniPaxosMetadataStore struct in the crates/metadata-store/src/omnipaxos/store.rs.

Some of the missing features are (non-exhaustive):

Trimming of the log
Snapshotting of the KvMemoryStorage
Integrating the OmniPaxosMetadataStore into the MetadataStoreState lifecycle
Let MetadataStoreClient select addresses based on MetadataStoreState::Active.
Observability
Improving log messages
Fine tuning timeouts
Hardening
Testing

This commit makes it configurable which metadata will be run by the Node when starting the Restate server.

This commit adds the skeleton of the Raft metadata store. At the moment only a single node with memory storage is supported. This fixes restatedev#1785.

The raft metadata store does not accept new proposals if there is no known leader. In this situation, request failed with an internal ProposalDropped error. This commit changes the behavior so that a ProposalDropped error will be translated into an unavailable Tonic status. That way, the request will get automatically retried.

This commit adds RocksDbStorage which implements raft::Storage. The RocksDbStorage is a durable storage implementation which is used by the RaftMetadataStore to store the raft state durably. This fixes restatedev#1791.

This fixes restatedev#1803.

The OmniPaxos metadata store stores its state in memory.

This commit introduces the ability to specify multiple addresses for the metadata store endpoint. On error, the GrpcMetadataStoreClient randomly switches to another endpoint. Moreover, this commit makes the OmniPaxosMetadataStore only accept requests if it is the leader. Additionally, it fails all pending callbacks if it loses leadership to avoid hanging requests if the request was not decided.

The Restate version enables OmniPaxos to run with a single peer.

tillrohrmann · 2025-01-09T15:03:35Z

crates/bifrost/src/service.rs

+
+        // todo we seem to have a race condition between this call and the provision step which might
+        //  write a different logs configuration
+        // self.bifrost.admin().init_metadata().await?;


This problem should already exist in the current main. The problem is that we first provision the NodesConfiguration and then try to write the Logs with the configured LogsConfiguration. If now the node succeeds at joining the cluster and reaches first the point where it starts the BifrostService, then it can happen that this line writes the initial Logs which does not respect the LogsConfiguration that is specified in the provision command.

tillrohrmann · 2025-01-09T15:04:54Z

crates/metadata-store/proto/metadata_store_network_svc.proto

+  rpc ConnectTo(stream NetworkMessage) returns (stream NetworkMessage);
+
+  // Try to join an existing metadata store cluster
+  rpc JoinCluster(JoinClusterRequest) returns (JoinClusterResponse);


Still need to properly document the contract and which status codes are returned in which cases.

tillrohrmann · 2025-01-09T15:08:55Z

crates/metadata-store/src/grpc/client.rs

+            if response.as_ref().is_err_and(|err| err.is_network_error()) {
+                self.choose_different_endpoint();
+            }


Poor man's solution to handle multiple endpoints. Requests that are rejected by non-leaders return an Unavailable status. That's why this + the retry policy will help us to find the actual leader by retrying. This also helps with dealing with metadata store nodes that are down. Very simplistic solution.

tillrohrmann · 2025-01-09T15:09:43Z

crates/metadata-store/src/grpc/client.rs

+                // This is potentially dangerous because we might have provisioned before on a
+                // different node but delivering of the response failed.
+                // todo harden by choosing oneself if one runs the metadata store. If not, then pick
+                //  a single node to reach out to.
+                self.choose_different_endpoint();


Dangerous because we might end up provisioning different metadata store nodes.

tillrohrmann · 2025-01-09T15:22:41Z

crates/metadata-store/src/kv_memory_storage.rs

+        // Not really happy about making the `KvMemoryStorage` aware of the NodesConfiguration. I
+        // couldn't find a better way to let a restarting metadata store know about the latest
+        // addresses of its peers which it reads from the NodesConfiguration. An alternative could
+        // be to not support changing addresses. Changing addresses will also only be possible as
+        // long as we maintain a quorum of running nodes. Otherwise, the nodes might not find each
+        // other to form quorum.
+        if let Some(metadata_writer) = self.metadata_writer.as_mut() {
+            if key == NODES_CONFIG_KEY {
+                let mut data = self
+                    .kv_entries
+                    .get(&key)
+                    .expect("to be present")
+                    .value
+                    .as_ref();
+                match StorageCodec::decode::<NodesConfiguration, _>(&mut data) {
+                    Ok(nodes_configuration) => {
+                        metadata_writer.submit(Arc::new(nodes_configuration));
+                    }
+                    Err(err) => {
+                        debug!("Failed deserializing NodesConfiguration: {err}");
+                    }
+                }
+            }
+        }


Maybe it is better to not rely on the NodesConfiguration to learn about the other nodes' addresses and instead require that their addresses don't change because in the general case this cannot work (stopping all nodes and restarting them with different addresses).

… on NodesConfiguration With this change, nodes can change their addresses as long as a quorum of running metadata store nodes keeps running so that the NodesConfiguration can be updated. It is a bit questionable how much worth this feature is. However, w/o it, metadata store nodes behave slightly differently than normal nodes (worker, log store, etc.) as they cannot change their address.

Using the generational node id as the OmniPaxos NodeId is a bit problematic because it will get bumped on every restart. If we fail while joining an existing OmniPaxos cluster and before persisting the configuration, we will try joining with a different id. If we allow higher generations to be accepted, then this is problematic because we can't distinguish between this situation and the loss of the disk. If we are strict about the id equality, then we might get the system stuck because the restarted node cannot join the OmniPaxos cluster which might expect its previous generation to be present. Therefore, this commit introduces the StorageId which is a random number stored in the RocksDbStorage. Whenever we try to join an OmniPaxos cluster we send this StorageId along with the PlainNodeId. Based on the StorageId we can detect if a node lost its disk because it would generate a new one.

tillrohrmann · 2025-01-09T15:31:36Z