-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot set resources Requests and Limits for workflow pods #3641
Comments
Hey @jonathan-fileread, is there a way to configure this in the default values.yaml file provided with the gha-runner-scale-set charts? |
@kanakaraju17 Hey Kanaka, unfortunately not. you need to create a seperate podtemplate in order to define the workflow pod, as the values.yaml only defines the runner pod settings. |
@jonathan-fileread, any idea why the file is not getting mounted in the runner pods? I'm using the following configuration and encountering the error below:
ConfigMap Configuration
The pods fail and end up with the below error:
Have you tried recreating it in your environment? Have you come across this error before? It seems to be a mounting issue where the file is not found. |
@kanakaraju17 You can follow the official guide which worked for me at least :) In your case that would be something like: ConfigMap:
Usage:
|
Hey @georgblumenschein, Deploying the gha-runner-scale-set by adding the below env variables doesn't seem to reflect.
Additional ENV Variable Added:
The workflow pods should include the ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE environment variable and volume mount but it doesn't appear when describing the pods. Currently, the output is missing this variable. Expected Result: Below are the values.yaml template used to append the environment variable:
Problem: Current Output: While Describing the AutoscalingRunnerSet doesn't show the ENV variables added either.
expected behavior: The ENV variable ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE getting added along with the volume mounts along the pods which will come up. |
Hey @kanakaraju17 , After 2 days of trail and error I managed to get a working scenario with resource limits applied. Funny thing is we were overcomplicating it using the "hook-exensions". All we need to is add it in the Below is a snippet of the values to pass into Helm (although I am using a HelmRelease with FluxCD, the principle still applies): values:
containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "standard"
resources:
requests:
storage: 10Gi
githubConfigSecret: gh-secret
githubConfigUrl : "https://github.com/<Organisation>"
runnerGroup: "k8s-nonprod"
runnerScaleSetName: "self-hosted-k8s" # used as a runner label
minRunners: 1
maxRunners: 10
template:
spec:
securityContext:
fsGroup: 1001
imagePullSecrets:
- name: cr-secret
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
resources:
limits:
cpu: "2000m"
memory: "5Gi"
requests:
cpu: "200m"
memory: "512Mi" I have confirmed that this has been working for me with some CodeQL workflows failing due to "insufficient RAM" lol. Hope it helps. |
@marcomarques-bt, I assume that the above configuration works only for runner pods and not the pods where the workflow runs i.e. the workflow pods. The above only works for runner pods. Refer to the image below, the configuration works for the first pod and not the second pod where the actual job runs. |
It seems that, similar to the issue mentioned earlier, toleration cannot be configured either. |
👋 Hey, thanks for opening this topic. I have managed to get this going but we have some large runners and we ran into an issue where if there are not resources available on the node the workflow pod fails to schedule...
and it needs to be scheduled on the same node as the runner because of the pvc. This whole thing doesn't make much sense. We want people to specify for example a large runner in kubernetes mode and at the end they we get an idle pod that just tries to spin up a new pod. |
@kanakaraju17 thanks for opening this issue. Did you ever find a mechanism to enforce resource limits? |
for @cboettig and those following this thread and the interaction between @kanakaraju17 and @georgblumenschein, I have made it work with the following configuration, I am sharing it as json as it's more clear that the configmap is properly formatted: ConfigMap:
runner-scaleset values:
This will add the resource requests and limits only for the |
That is all really good but the moment you set resource on the workflow pod and there is no space on the node hosting the controller pod you are out of luck... It won't wait for available resource but just fail. We are in the process of evaluating the option to use kube scheduler but that requires changing the PVC to RWM which is expensive and has its limitations. We are in AWS and have tried EFS and IO2 but both don't work well. GitHub should really implement this properly as it is really handicapped at the moment. |
@velkovb you are right, by setting requests on the workflow pod but not on the controller pod, we quickly ran into that issue: the controller pod always has room in the node, but the whole action fails if there's no room for its corresponding workflow pod. So far we have worked around it by assigning requests on the controller pod, and none on the workflow one. That way the workflow pod always has room, and we count on it cannibalizing the resources assigned to the controller pod, since the controller is very lightweight. This is not an ideal solution but the best we can come up without RWX. What issues have you experienced with IO2? That was my next alternative to try, so we can use kube scheduler and not worry about controller and workflow pods having to land on the same node. |
I don't think I get how that works. If you set requests for the controller pod, won't it actually reserve it for that pod and not give it to anything else? I would see it work for CPU but not sure it does for memory?
Multi-attach works in |
The requests guarantee that the specified amount of cpu is available at scheduling time, but if the workflow pod requires cpu time and the controller is idle, it will take it from it. This is not the case for memory, which is why I have only set requests for cpu |
@velkovb We've migrated to a RWX setup with a NFS CSI storage class to avoid the multi-attach error of RWO - however we're experiencing slowness with workflow pods being provisioned (usually takes 3 minutes per github action job). I suspect it has something to do with FS slowness (not sure if its provisioning, or just using it in general). Do you have any recommendations? We've opened a ticket here #3834 |
My findings were that the slowness was in the The first log message I see after the container starts is - https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/hooks/prepare-job.ts#L45. The slowness is not in PVC provisioning as that goes really fast. That workspace seems to be only ~250MB so not sure why it is so slow. |
This is also bothering me—how something seemingly so basic of a requirement isn’t a standard option out of the box. I’m considering trying the following approaches to ensure the workflow pod will fit (resource-wise) and be scheduled onto the same node:
To address different workload needs, I’m planning to define multiple runner scale sets with varying sizes, allowing developers to select the one that fits their requirements:
Each class maps to specific instance types:
Has anyone else approached this problem in a similar way? If so, I’d love to hear any pointers or lessons learned. Also, if anyone sees potential holes in my plan or areas for improvement, please let me know! |
@jasonwbarnett We started with similar more granular approach to resource ratios but noticed that it was not followed strictly and actually kubernetes nodes had a lot of free resources. Besides what you have in mind we had further break down for The idea with having CPU on the controller and memory on the workflow pod is a good workaround :) |
@velkovb Thanks for sharing your experience! It’s interesting to hear that a more granular approach led to inefficiencies due to underutilized resources. I can definitely see how managing over 30 scale sets could become unwieldy. Your T-shirt sizing approach with small, medium, and large options (and differentiating x64/arm64 and dind/non-dind) sounds like a practical way to simplify things while still offering flexibility. Did you have the scalesets mapped to specific karpenter nodepools or what? I hadn’t considered not setting resource limits on the pods to allow them to utilize free node resources—that’s an intriguing idea. I imagine it works well with your setup of a single large spot node pool, as you can maximize utilization without worrying too much about strict separation. Thanks for the feedback on the controller CPU and workflow pod memory approach! I’ll experiment further with these ideas and keep the potential for over-complication in mind. If you don’t mind me asking, how do you handle scaling with the T-shirt sizes—do you find it works well with just the large node pool, or are there edge cases where it gets tricky? |
@jasonwbarnett Just one large node spot node pool and we have an on-demand one for edge cases of really long running jobs. We were monitoring our nodes and the resource usage was rarely going above 30% and that made us try no resource limits. For workloads we always set memory request = memory limit but here due to the short lifetime of a job pod we believe it wouldn't be a problem. We run roughly 20k jobs a day and so far it seems to be working fine :) In our config we have some warm runners for the most used types and we use overprovisioning to keep 1-2 warm nodes (as that is usually the slowest thing). |
Checks
Controller Version
0.9.2
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
The runner pods, which have names ending with "workflow," should have the specified resource requests and limits for CPU and memory when they are created.
Describe the expected behavior
The workflow pod that is created during the pipeline execution should have specific CPU and memory limits and requests set. However, it is not starting with the specified resources and limits.
Additionally, an extra pod is being created when the pipeline runs, alongside the existing runner pods. We need to understand the purpose of the existing runner pod if a new pod is also being initiated. Added the detail of the extra pod in the screenshot below.
Additional Context
Controller Logs
https://gist.github.com/kanakaraju17/31a15aa0a1b5a04fb7eaab6996c02d40 [this is not related to the resource request constraint for the runner pods]
Runner Pod Logs
The text was updated successfully, but these errors were encountered: