Temporary failure in name resolution while running tasks using KubernetesExecutor #22319
Replies: 6 comments
-
Hi, @Siddharthk are you still facing the same issue? Because I am facing the same issue when running a DAG that has 500 parallel tasks because I am running some stress tests on airflow. In the DAG task, I have an iterator parameter, and by changing it I can modify the duration of each task. It does not matter how long the task lasts, I have had the issue with tasks that last from seconds to more than 20 minutes. I am using KubernetesExecutor and when fetching the pods using kubectl I am getting this:
And when checking the logs of one of the tasks I am getting this error:
These are some of the configuration variables of my airflow cluster:
And besides that config, I have set up the |
Beta Was this translation helpful? Give feedback.
-
You may find this StackOverflow entry interesting: https://stackoverflow.com/questions/56188537 From what I can gather, this tends to happen when you send a ton of things around in Kubernetes very quickly because the DNS lookup is overwhelmed and cannot respond to requests quickly enough, and you need to tweak your Kubernetes setup to make it work. This is probably not something Airflow has control over. |
Beta Was this translation helpful? Give feedback.
-
Thanks, @uranusjr. I will check that entry. |
Beta Was this translation helpful? Give feedback.
-
We are using a very old K8s distribution and experience this issue
We use the official Apache Airflow Helm Chart and have pgbouncer activated with default arguments er per documentation. We are in the process of migrating all of our K8s infrastructure to AWS EKS. I will document id the newer K8s distributions fixes this issue when we have accomplished that. |
Beta Was this translation helpful? Give feedback.
-
It's due to kube-dns component, usually you should see a log with warning that health check response was more than 1 second, which means kube-dns is overhealmed. I suggest either manual scale or setup scale for kube-dns service based on some metric |
Beta Was this translation helpful? Give feedback.
-
Thanks for the insight @MaxKavun |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version: airflow:1.10.10.1-alpha2-python3.6
Kubernetes version (if you are using kubernetes) (use kubectl version): 1.16.8
Environment:
Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release): Redhat
Install tools: Official Helm Chart
What happened:
Tasks are throwing below error:
When I clear the tasks and run again, it works. The logs do not give any info where the issue has occurred.
What you expected to happen:
All task should be running when it is triggered initially.
Any help is appreciated.
Beta Was this translation helpful? Give feedback.
All reactions