Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.1 API Review #154

Open
wants to merge 3 commits into
base: v0.0
Choose a base branch
from
Open

v0.1 API Review #154

wants to merge 3 commits into from

Conversation

kfswain
Copy link
Collaborator

@kfswain kfswain commented Jan 6, 2025

This PR is not intended to be merged, merely a point of reference for review.

Slides: https://docs.google.com/presentation/d/1gtOJS1YA0Ax8KvsGPrHiyZR2dBoWnlLd9aACyojmk68/edit#slide=id.p

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kfswain
Once this PR has been reviewed and has the lgtm label, please assign nikhita for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 6, 2025
@kfswain
Copy link
Collaborator Author

kfswain commented Jan 6, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2025
@mrbobbytables mrbobbytables removed their request for review January 6, 2025 21:13
@shaneutt
Copy link
Member

shaneutt commented Jan 8, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from shaneutt January 8, 2025 17:18
Comment on lines +134 to +136
// Sheddable defines the lowest level of criticality. Requests to this band will be shed before
// all other bands.
Sheddable Criticality = "Sheddable"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using the name Sheddable exposes the result of this Criticality type rather than a descriptive type. I suggest using NonCritical or some other descriptive type name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that it describes what the criticality mechanism does.

I'm having a hard time coming up with an adequate name that wouldn't be synonymous. Non-Critical could be viewed as anything that isn't of the Critical level, which is would include the Default and Sheddable.

I suppose we could use BestEffort to be in line with the Pod QoS classes. But we don't have an equivalent for Guaranteed and maybe a loose equivalent for Burstable.

Definitely open to better names though. LMKWYT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also open to other name ideas here. Currently we have:

  • Critical
  • Default
  • Sheddable

I don't really love Criticality: NonCritical or Criticality: Critical, maybe there are 3 different terms that clearly indicate a 3 distinct layers of criticality? A naive example would be High, Medium, Low, but that's not the best.

cc @ahg-g

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on changing Criticality to Priority or introducing a CriticalityClass concept similar to PriorityClass, where the level of criticality can be user-defined? For example:

apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: CriticalityClass
metadata:
  name: non-essential
value: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  criticalityClassName: non-essential
  modelName: tweet-summary
  poolRef:
    name: vllm-llama2-7b-pool
  targetModels:
  - name: tweet-summary-0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so torn here. I agree with the idea of a user defined priority system... However:

I worry how that might interact with an algo. If we open the gates to user defined prio, we are essentially letting untested priority sets run in prod and implicitly supporting that. I'm no Load Balancing algo expert, but from my limited experience, it seems that could be incredibly challenging to guarantee that it works correctly.
If we keep a discrete set, we maintain the bounds on what is possible, and are able to provide stronger promises.

I also could be worrying about nothing. LMKWYT

Copy link
Contributor

@ahg-g ahg-g Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to summarize today's discussion during the community meeting:

  1. There is some agreement on continuing to use an enum and a predefined set of criticality levels
  2. Change it to a reference and not default it in the api to allow extending this in the future with another parameter forming a oneOf semantics.

We have another comment thread on changing the name for Default, but regarding Sheddable I feel it offers clear semantics as to what it means: that it is the first to be dropped when capacity is constrained. I don't have better suggestions though to address @candita which is completely valid!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing out an idea. Instead of Criticality there could be a simple Sheddable boolean, described with caveats like: when model workloads exceed resources, some model workloads may be dropped (or moved to the end of a scheduling queue) in order to avoid starvation of higher value model workloads. If this model allows for best-effort scheduling, marking it as Sheddable will help obtain results faster from higher value model workloads.

This begs the question about whether the scheduling priority belongs on workloads rather than models. Is there an API object for workloads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InferenceModel object is intended to represent a model as a workload(inference). The term workload was elected to not be used, as K8s already has a pretty well established definition: https://kubernetes.io/docs/concepts/workloads/

All that to say, I think we are in agreement.

api/inferencemodel_types.go Outdated Show resolved Hide resolved
type InferenceModelStatus struct {
// Conditions track the state of the InferencePool.
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
Copy link

@candita candita Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using a specifically typed condition as the type of the array, rather than a generic one. This helps to set expectations about what types of conditions are generated. For example, start with a set that describes the availability and degradation of the InferenceModell?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack! I left the array as metav1.Condition to match Gateway's prior art: https://github.com/kubernetes-sigs/gateway-api/blob/0e465a76f43137e1d39771ea1a75b6190e7b4ac6/apis/v1/gateway_types.go#L735

But have created typed condition types/reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@candita I completely agree that some sample typed conditions would be helpful here so we can have consistent status across implementations.

As far as a new Condition type, I don't think we've done that before. In Gateway API, we've exclusively used metav1.Condition and then added typed conditions as recommendations for what we'd expect to be set within those conditions. As far as I know, we've never actually added custom validation or types for these conditions though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfswain @robscott sounds good. Having a default/starting condition like set in https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/gateway_types.go#L734 would be great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, completely agree, this also feels like something we should try to get in before releasing v0.1

api/inferencemodel_types.go Outdated Show resolved Hide resolved
// InferencePoolStatus defines the observed state of InferencePool
type InferencePoolStatus struct {
// Conditions track the state of the InferencePool.
Conditions []metav1.Condition `json:"conditions,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise with this section also

Comment on lines +58 to +61
// The number must be in the range 1 to 65535.
//
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=65535
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If applicable, I suggest reducing this port range to a smaller list of well-known ports that users can rely on for firewall configuration purposes. Also, don't allow overlap with other well-known ports like those used for dns, http/s, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding off on changing this one, just to gather consensus on what range we should limit it to. But I do agree with the idea.

It's possible that we could start with a small range and relax as needed. As the other direction would be nigh impossible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@candita this is meant to be a reference to a port number on a Pod, I can't think of any reasonable way to limit that since I think Kubernetes has likely scaled far enough that there's probably at least one case of each individual port being in use across the many Kubernetes that exist.

@robscott
Copy link
Member

robscott commented Jan 9, 2025

Thanks to everyone for the help reviewing this! Any/all comments are very appreciated.

We also went over this at a high level in a meeting with SIG-Net TLs (and others), walking through some slides as a reference point. Since this is starting as an x-k8s.io API, formal API review is optional, but we're trying to get as much feedback as we can early in the process. Barring any blocking feedback, we're hoping to release v0.1 next week.

Copying SIG-Net TLs and chairs if any have time for additional review cycles.

/cc @aojea @danwinship @MikeZappa87 @shaneutt @thockin

@shaneutt
Copy link
Member

shaneutt commented Jan 9, 2025

Copying SIG-Net TLs and chairs if any have time for additional review cycles.

I've set aside time tomorrow (Friday) to review 👍

// defaults to 1.
//
// +optional
// +kubebuilder:default=1
Copy link
Contributor

@danehans danehans Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on API review feedback from @smarterclayton, this default should be removed. Defaults are easier to add later as the API matures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, weight is a concept specific to a load-balancing/scheduling extension. During the API review, we discussed the need to document this dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on all points. I think we can tackle all of these items with smaller individual PRs instead of trying to solve everything in one go, so if anyone's looking for ways to jump in, feel free to grab some of these actionable comments and turn them into PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was going to capture them here and then just copy the contents of the files back in to their original homes. Handling them piecewise works also. Open to whatever

// +kubebuilder:default=1
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=1000000
Weight int32 `json:"weight,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If defaulting is removed, should this field be a pointer since it's optional? API conventions state this should be a pointer if it's optional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agree on both points - default should be removed and this should become a pointer.

// Criticality defines how important it is to serve the model compared to other models referencing the same pool.
//
// +optional
// +kubebuilder:default="Default"
Copy link
Contributor

@danehans danehans Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on feedback from @smarterclayton, this default should be removed to allow this concept to evolve by introducing oneOf (union types). If this default is being removed, should we also remove the associated "Default" enum?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - if we're no longer defaulting, we can't really call the middle value "default", so we likely need at least one new name here. Any ideas?

cc @ahg-g

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, based on today's discussion we should remove this defaulting.

Regarding "Default", I think we have two options:

  1. Rename to Medium or BestEffort
  2. Remove it completely and stick with two only: Critical and Sheddable. This is perhaps acceptable now that we have a clearer path to extending the scope in the future.

wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are boiling this to two values, do we want to make the Criticality type be an indirection to bool rather than string, and then just rename to Critical?

@candita mentions the same: #154 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still trying to understand the intent of criticality. Can someone help me understand what "serve" means in the godocs? Does "serve" mean if the Node does not have enough resources, i.e. memory, model servers of an InferenceModel with criticality: BestEffort will be evicted, or is criticality a networking classification concept?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair question. We used to have a glossary to define some of these terms. (we call it priority here)

But to quickly summarize. Criticality is intended to be a load-balancing concept. So if there is always enough capacity to place a request on any model server, criticality does nothing. But should we need to queue requests, criticality will prioritize the requests that come from a critical workload. WE cant exactly guarantee like we can with scheduling on a node as traffic can wax and wane much more quickly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfswain thanks for the glossary reference. Consider including points from the glossary or ^ to the Criticality godocs to better define the concept. Without the additional details, I can see someone interpreting Criticality as pod priority, network priority, etc.

api/inferencepool_types.go Outdated Show resolved Hide resolved
@k8s-ci-robot
Copy link
Contributor

@kfswain: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gateway-api-inference-extension-test-unit-main 613fd37 link true /test pull-gateway-api-inference-extension-test-unit-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants