Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actions runner controller gets stuck when pods get evicted #3840

Closed
4 tasks done
PythonCoderAS opened this issue Dec 10, 2024 · 3 comments
Closed
4 tasks done

Actions runner controller gets stuck when pods get evicted #3840

PythonCoderAS opened this issue Dec 10, 2024 · 3 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@PythonCoderAS
Copy link

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Trigger a lot of jobs that require runners (maybe a matrix with a large number of combinations)
2. Trigger a disk pressure taint on the node (an example is to run out of PVC space on the storage allocator)
3. Remove any other node from the cluster

Describe the bug

My cluster hit a disk pressure taint and 15 pods got evicted. However, when the taint was removed, the evicted pods were stuck and no new pods were launched, meaning that no new self-hosted actions ran until I manually deleted the evicted pods.

Describe the expected behavior

Either it should auto-delete the evicted pods, or just create new pods on top.

Additional Context

githubConfigUrl: "https://github.com/HARP-research-Inc"
githubConfigSecret:
  github_token: "<redacted>"
containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "local-path"
    resources:
      requests:
        storage: 2Gi
minRunners: 3

Controller Logs

https://gist.github.com/PythonCoderAS/19d8fc2f3e6a0623ca8311bede0b027e

Runner Pod Logs

https://gist.github.com/PythonCoderAS/1f39f9769e9ce8e3060f1e4791d7508d
@PythonCoderAS PythonCoderAS added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Dec 10, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@luis02lopez
Copy link

Same here, looks like the autoscaling thinks that pods are running and doesn't deploy new ones

@nikola-jokic
Copy link
Collaborator

Hey @PythonCoderAS,

This is by design. The controller is responsible for trying to provision runners to match the desired count. If there is an issue with the cluster, causing 5 pod failures, we use it as a signal that something is wrong and should be taken care of. The controller cannot know when the issue is resolved, so it stops, leaving you to fix the issue and clean up the failed runners.

Otherwise, the controller will keep trying to recreate pods that will likely fail again. Recreation would cause pressure on the back-end for something that will likely fail. It can also get rate-limited by the Kubernetes API, causing issues for other scale sets. Therefore, this mechanism can help diagnose the issue and reduce the likelihood of affecting other scale sets.

Since this is already tracked by #2721, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

3 participants