Temporary failure in name resolution while running tasks using KubernetesExecutor #22319

Siddharthk · 2020-10-09T16:08:14Z

Siddharthk
Oct 9, 2020

Apache Airflow version: airflow:1.10.10.1-alpha2-python3.6

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.16.8

Environment:

Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release): Redhat
Install tools: Official Helm Chart

What happened:
Tasks are throwing below error:

Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 25, in <module>
    from airflow.configuration import conf
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/__init__.py", line 31, in <module>
    from airflow.utils.log.logging_mixin import LoggingMixin
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/__init__.py", line 24, in <module>
    from .decorators import apply_defaults as _apply_defaults
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/decorators.py", line 36, in <module>
    from airflow import settings
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/settings.py", line 121, in <module>
    prefix=conf.get('scheduler', 'statsd_prefix'))
  File "/home/airflow/.local/lib/python3.6/site-packages/statsd/client/udp.py", line 35, in __init__
    host, port, fam, socket.SOCK_DGRAM)[0]
  File "/usr/local/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

When I clear the tasks and run again, it works. The logs do not give any info where the issue has occurred.

myapp-1a78cff297294ecf9bf625ae084d4f30                      0/1     Error       0          65m <-- initial trigger
myapp-61752820f4ce4f3a9b7d718740eeaaac                      0/1     Completed   0          58m <-- after manually clearing task

What you expected to happen:

All task should be running when it is triggered initially.

Any help is appreciated.

rodrigoechaide · 2021-08-03T13:32:29Z

rodrigoechaide
Aug 3, 2021

Hi, @Siddharthk are you still facing the same issue? Because I am facing the same issue when running a DAG that has 500 parallel tasks because I am running some stress tests on airflow. In the DAG task, I have an iterator parameter, and by changing it I can modify the duration of each task. It does not matter how long the task lasts, I have had the issue with tasks that last from seconds to more than 20 minutes. I am using KubernetesExecutor and when fetching the pods using kubectl I am getting this:

k get pods -n airflow | grep Error
performancetest500tasksinparallel20taskperformancetest500tasksd.04c772dbda6c47b79b017c90b73055af   0/1     Error       0          8m27s
performancetest500tasksinparallel20taskperformancetest500tasksd.05269b668c8043c7b7ac32c0e06ce2bc   0/1     Error       0          6m31s
performancetest500tasksinparallel20taskperformancetest500tasksd.0819b03e3fda475abfd3893dc7598ffb   0/1     Error       0          8m56s
performancetest500tasksinparallel20taskperformancetest500tasksd.09f9fa1367194deabead2b7d6de72c83   0/1     Error       0          8m2s
performancetest500tasksinparallel20taskperformancetest500tasksd.0c61b81e7dfc4d17846c89d78eefac0c   0/1     Error       0          5m59s
performancetest500tasksinparallel20taskperformancetest500tasksd.0d0b39ea912a48c898d13b5392c0ee7e   0/1     Error       0          8m41s
performancetest500tasksinparallel20taskperformancetest500tasksd.0d1e17539b934616a0f72a05b530d88e   0/1     Error       0          8m33s
performancetest500tasksinparallel20taskperformancetest500tasksd.12e3fd2a030340589e251c987652c61e   0/1     Error       0          9m16s
performancetest500tasksinparallel20taskperformancetest500tasksd.1312a64638e34ee488d5f8839a29c0e6   0/1     Error       0          7m25s
performancetest500tasksinparallel20taskperformancetest500tasksd.1508cf02371d4dff8c925a3855a60911   0/1     Error       0          7m31s
performancetest500tasksinparallel20taskperformancetest500tasksd.1d3c9140a24e42c29fe5def938832759   0/1     Error       0          7m17s
performancetest500tasksinparallel20taskperformancetest500tasksd.1e5cee28a93b4f62bc1c06d1bb6ed785   0/1     Error       0          8m30s
performancetest500tasksinparallel20taskperformancetest500tasksd.214e5df400c24764b9104e5e324dc314   0/1     Error       0          8m55s
performancetest500tasksinparallel20taskperformancetest500tasksd.272b9e6502ce49078c68741731aa8144   0/1     Error       0          7m39s
performancetest500tasksinparallel20taskperformancetest500tasksd.2840867f20a34a4fae6ad71ff1ef2803   0/1     Error       0          6m3s
performancetest500tasksinparallel20taskperformancetest500tasksd.2aca869d190d4a17a60653788d73e090   0/1     Error       0          7m22s
performancetest500tasksinparallel20taskperformancetest500tasksd.2d6f588cba464f2c9aec0f75eff105a5   0/1     Error       0          6m32s
performancetest500tasksinparallel20taskperformancetest500tasksd.31513adf9a4d4faa910b8eeedf53b960   0/1     Error       0          8m48s
performancetest500tasksinparallel20taskperformancetest500tasksd.3600857bd1784617b4322ec304924870   0/1     Error       0          8m58s
performancetest500tasksinparallel20taskperformancetest500tasksd.3659ef7cbcb345e99ba557e6ca6b881d   0/1     Error       0          9m1s

And when checking the logs of one of the tasks I am getting this error:

k logs performancetest500tasksinparallel20taskperformancetest500tasksd.6426f08f727c4f15b2c041ce98f163d5 -n airflow
[2021-08-03 12:44:59,468] {cli_action_loggers.py:105} WARNING - Failed to log action with (psycopg2.OperationalError) could not translate host name "qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary failure in name resolution

(Background on this error at: http://sqlalche.me/e/13/e3q8)
[2021-08-03 12:44:59,469] {dagbag.py:496} INFO - Filling up the DagBag from /opt/airflow/dags/git/performance_test_500_tasks_in_parallel_2_0.py
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 364, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
    self.__connect(first_connect_check=True)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
    connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 508, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not translate host name "qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary failure in name resolution


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/usr/local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/airflow/utils/cli.py", line 91, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", line 227, in task_run
    ti.refresh_from_db()
  File "/usr/local/lib/python3.9/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 625, in refresh_from_db
    ti = qry.first()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3429, in first
    ret = list(self[0:1])
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3203, in __getitem__
    return list(res)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3535, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3556, in _execute_and_instances
    conn = self._get_bind_args(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3571, in _get_bind_args
    return fn(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3550, in _connection_from_session
    conn = self.session.connection(**kw)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1142, in connection
    return self._connection_for_bind(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1150, in _connection_for_bind
    return self.transaction._connection_for_bind(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind
    conn = bind._contextual_connect()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1583, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 364, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
    self.__connect(first_connect_check=True)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
    connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 508, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "qa-airflow.carnijjbfa3r.eu-west-1.rds.amazonaws.com" to address: Temporary failure in name resolution

(Background on this error at: http://sqlalche.me/e/13/e3q8)

These are some of the configuration variables of my airflow cluster:

  AIRFLOW_HOME: "/opt/airflow"
  AIRFLOW__CORE__DAGS_FOLDER: "/opt/airflow/dags/git"
  AIRFLOW__LOGGING__BASE_LOG_FOLDER: "/opt/airflow/logs"
  AIRFLOW__LOGGING__LOGGING_LEVEL: "INFO" # DEBUG, INFO, WARNING, ERROR or CRITICAL.
  AIRFLOW__LOGGING__FAB_LOGGING_LEVEL: "WARNING"
  AIRFLOW__LOGGING__LOG_FILENAME_TEMPLATE: "{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log"
  AIRFLOW__LOGGING__LOG_FORMAT: "%(message)s"
  AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "60"
  AIRFLOW__CORE__DAG_CONCURRENCY: "500"
  AIRFLOW__CORE__PARALLELISM: "500"
  AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE: "0"
  AIRFLOW__CORE__EXECUTOR: "KubernetesExecutor"
  AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.default"
  AIRFLOW__CORE__LOAD_EXAMPLES: "False"
  AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: "1.1"
  AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "True"
  AIRFLOW__KUBERNETES__NAMESPACE: "airflow"
  AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE: "1" 
  AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE: "/opt/airflow/template.yaml"
  AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"

And besides that config, I have set up the default_pool size of 500 slots in order to be able to run 500 parallel tasks.

0 replies

uranusjr · 2021-08-03T13:36:53Z

uranusjr
Aug 3, 2021
Collaborator

You may find this StackOverflow entry interesting: https://stackoverflow.com/questions/56188537

From what I can gather, this tends to happen when you send a ton of things around in Kubernetes very quickly because the DNS lookup is overwhelmed and cannot respond to requests quickly enough, and you need to tweak your Kubernetes setup to make it work. This is probably not something Airflow has control over.

0 replies

rodrigoechaide · 2021-08-03T13:52:30Z

rodrigoechaide
Aug 3, 2021

Thanks, @uranusjr. I will check that entry.

0 replies

lewismc · 2022-07-13T19:12:10Z

lewismc
Jul 13, 2022

We are using a very old K8s distribution and experience this issue

kubectl version --output=yaml
clientVersion:
  buildDate: "2022-05-03T13:46:05Z"
  compiler: gc
  gitCommit: 4ce5a8954017644c5420bae81d72b09b735c21f0
  gitTreeState: clean
  gitVersion: v1.24.0
  goVersion: go1.18.1
  major: "1"
  minor: "24"
  platform: darwin/amd64
kustomizeVersion: v4.5.4
serverVersion:
  buildDate: "2019-04-08T18:22:05Z"
  compiler: gc
  gitCommit: 16236ce91790d4c75b79f6ce96841db1c843e7d2
  gitTreeState: dirty
  gitVersion: v1.11.9-dirty
  goVersion: go1.10.8
  major: "1"
  minor: 11+
  platform: linux/amd64

We use the official Apache Airflow Helm Chart and have pgbouncer activated with default arguments er per documentation.

We are in the process of migrating all of our K8s infrastructure to AWS EKS. I will document id the newer K8s distributions fixes this issue when we have accomplished that.

0 replies

MaxKavun · 2022-09-08T12:31:40Z

MaxKavun
Sep 8, 2022

It's due to kube-dns component, usually you should see a log with warning that health check response was more than 1 second, which means kube-dns is overhealmed. I suggest either manual scale or setup scale for kube-dns service based on some metric

0 replies

lewismc · 2022-10-13T20:51:51Z

lewismc
Oct 13, 2022

Thanks for the insight @MaxKavun

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporary failure in name resolution while running tasks using KubernetesExecutor #22319

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Temporary failure in name resolution while running tasks using KubernetesExecutor #22319

Siddharthk Oct 9, 2020

Replies: 6 comments

rodrigoechaide Aug 3, 2021

uranusjr Aug 3, 2021 Collaborator

rodrigoechaide Aug 3, 2021

lewismc Jul 13, 2022

MaxKavun Sep 8, 2022

lewismc Oct 13, 2022

Siddharthk
Oct 9, 2020

rodrigoechaide
Aug 3, 2021

uranusjr
Aug 3, 2021
Collaborator

rodrigoechaide
Aug 3, 2021

lewismc
Jul 13, 2022

MaxKavun
Sep 8, 2022

lewismc
Oct 13, 2022