Adds Health gRPC Server and Refactors Main() #148

danehans · 2025-01-04T01:02:05Z

Adds a health gRPC Server and refactors main() for better lifecycle management:

Introduced a health gRPC server to handle liveness and readiness probes.
Refactored main() to manage server goroutines using sync.WaitGroup.
Added graceful shutdown for servers and controller manager.
Improved logging consistency and ensured datastore readiness checks.

Fixes #96
Fixes #175

kfswain · 2025-01-06T23:06:29Z

/lgtm

ahg-g · 2025-01-06T18:00:10Z

pkg/ext-proc/backend/datastore.go

+		}
+		ready = true
+		return false
+	})


At startup, I think we want to ensure that the extension did a sync with the api server and fetched the models, but not declare itself ready only if at least one model is defined.

The health probe now uses a client to check the API server for the configured InferencePool and that at least one InferenceModel exists in the same namespace. Should this probe also check that at least one InferenceModel references the configured InferencePool?

I don't think that the health check needs to block on at least one InferenceModel. On the other hand, since extension is currently 1:1 with InferencePool, I think it makes sense to ensure that the extension successfully initialized the assigned InferencePool.

pkg/ext-proc/main.go

k8s-ci-robot · 2025-01-09T05:24:47Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2025-01-09T05:24:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danehans
Once this PR has been reviewed and has the lgtm label, please ask for approval from kfswain. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- Introduced a health gRPC server to handle liveness and readiness probes. - Refactored main() to manage server goroutines. - Added graceful shutdown for servers and controller manager. - Improved logging consistency and ensured. - Validates CLI flags. Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

ahg-g · 2025-01-09T22:31:34Z

pkg/ext-proc/health.go

+	// Ensure at least 1 InferenceModel
+	if len(modelList.Items) == 0 {
+		return fmt.Errorf("no InferenceModels exist in namespace %s", *poolNamespace)
+	}


I am not sure this is necessary.

ahg-g · 2025-01-09T22:44:25Z

pkg/ext-proc/main.go

+		*targetPodHeader,
+	)
+
+	// Wait for first error from any goroutine


or the controller manager returning gracefully

ahg-g · 2025-01-09T22:56:22Z

pkg/ext-proc/health.go

+	client.Client
+}
+
+func (s *healthServer) Check(ctx context.Context, in *healthPb.HealthCheckRequest) (*healthPb.HealthCheckResponse, error) {


I think this can check that the datastore populated the inference pool instead of actually doing a pull from the server?

k8s-ci-robot · 2025-01-10T02:40:55Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested a review from ahg-g January 4, 2025 01:02

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 4, 2025

k8s-ci-robot requested a review from kfswain January 4, 2025 01:02

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 4, 2025

k8s-ci-robot assigned kfswain Jan 6, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 6, 2025

ahg-g reviewed Jan 6, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2025

danehans force-pushed the issue_96 branch from b554907 to 13d7f04 Compare January 9, 2025 05:24

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2025

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 9, 2025

danehans force-pushed the issue_96 branch from 13d7f04 to f46c9ec Compare January 9, 2025 05:32

danehans requested a review from ahg-g January 9, 2025 16:04

danehans changed the title ~~Adds Health gRPC Server and Refactor Main()~~ Adds Health gRPC Server and Refactors Main() Jan 9, 2025

ahg-g reviewed Jan 9, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Health gRPC Server and Refactors Main() #148

Adds Health gRPC Server and Refactors Main() #148

danehans commented Jan 4, 2025 •

edited

Loading

kfswain commented Jan 6, 2025

ahg-g Jan 6, 2025

danehans Jan 9, 2025

ahg-g Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

ahg-g Jan 9, 2025

ahg-g Jan 9, 2025

ahg-g Jan 9, 2025

k8s-ci-robot commented Jan 10, 2025

Adds Health gRPC Server and Refactors Main() #148

Are you sure you want to change the base?

Adds Health gRPC Server and Refactors Main() #148

Conversation

danehans commented Jan 4, 2025 • edited Loading

kfswain commented Jan 6, 2025

ahg-g Jan 6, 2025

Choose a reason for hiding this comment

danehans Jan 9, 2025

Choose a reason for hiding this comment

ahg-g Jan 9, 2025

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

ahg-g Jan 9, 2025

Choose a reason for hiding this comment

ahg-g Jan 9, 2025

Choose a reason for hiding this comment

ahg-g Jan 9, 2025

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 10, 2025

danehans commented Jan 4, 2025 •

edited

Loading