Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Health gRPC Server and Refactors Main() #148

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

danehans
Copy link
Contributor

@danehans danehans commented Jan 4, 2025

Adds a health gRPC Server and refactors main() for better lifecycle management:

  • Introduced a health gRPC server to handle liveness and readiness probes.
  • Refactored main() to manage server goroutines using sync.WaitGroup.
  • Added graceful shutdown for servers and controller manager.
  • Improved logging consistency and ensured datastore readiness checks.

Fixes #96
Fixes #175

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g January 4, 2025 01:02
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 4, 2025
@k8s-ci-robot k8s-ci-robot requested a review from kfswain January 4, 2025 01:02
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 4, 2025
@kfswain
Copy link
Collaborator

kfswain commented Jan 6, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 6, 2025
}
ready = true
return false
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At startup, I think we want to ensure that the extension did a sync with the api server and fetched the models, but not declare itself ready only if at least one model is defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health probe now uses a client to check the API server for the configured InferencePool and that at least one InferenceModel exists in the same namespace. Should this probe also check that at least one InferenceModel references the configured InferencePool?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the health check needs to block on at least one InferenceModel. On the other hand, since extension is currently 1:1 with InferencePool, I think it makes sense to ensure that the extension successfully initialized the assigned InferencePool.

pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danehans
Once this PR has been reviewed and has the lgtm label, please ask for approval from kfswain. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 9, 2025
- Introduced a health gRPC server to handle liveness and readiness probes.
- Refactored main() to manage server goroutines.
- Added graceful shutdown for servers and controller manager.
- Improved logging consistency and ensured.
- Validates CLI flags.

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>
@danehans danehans requested a review from ahg-g January 9, 2025 16:04
@danehans danehans changed the title Adds Health gRPC Server and Refactor Main() Adds Health gRPC Server and Refactors Main() Jan 9, 2025
// Ensure at least 1 InferenceModel
if len(modelList.Items) == 0 {
return fmt.Errorf("no InferenceModels exist in namespace %s", *poolNamespace)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is necessary.

*targetPodHeader,
)

// Wait for first error from any goroutine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the controller manager returning gracefully

client.Client
}

func (s *healthServer) Check(ctx context.Context, in *healthPb.HealthCheckRequest) (*healthPb.HealthCheckResponse, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can check that the datastore populated the inference pool instead of actually doing a pull from the server?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate EPP Flags Add a liveness/readiness endpoint to the extension
4 participants