Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning Event: Readiness probe failed: HTTP probe failed with statuscode: 500 occured while upgrade #12401

Open
dbbDylan opened this issue Nov 22, 2024 · 8 comments
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@dbbDylan
Copy link

What happened:

  • Background: ingress-nginx-controller zero downtime upgrade investigation.
  • Strategy: I used helm upgrade --reuse-values command to complete upgrade.

The system operates smoothly if no requests are sent during the upgrade period. However, when using Grafana K6 to monitor the frequency of HTTPS requests, an error occurs as the new controller pod is fully initialized and the old pod begins to terminate. This issue only lasts for a brief moment, yet it can be consistently reproduced.

Here is the warning event:
readiness-probe-failed

And here is the K6 test log:

$ sh run.sh

         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/

     execution: local
        script: script.js
        output: -

     scenarios: (100.00%) 1 scenario, 1024 max VUs, 2m30s max duration (incl. graceful stop):
              * default: 1024 looping VUs for 2m0s (gracefulStop: 30s)
      
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": EOF"                                                                                                                          
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": read tcp 10.59.89.82:59064->10.47.104.129:80: wsarecv: An existing connection was forcibly closed by the remote host."        
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": EOF"                                                                                                                                    
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": read tcp 10.59.89.82:59082->10.47.104.129:80: wsarecv: An existing connection was forcibly closed by the remote host."                                                                                                                          
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": EOF"                                                                                                                          

     data_received..................: 37 MB 295 kB/s
     data_sent......................: 12 MB 93 kB/s
     http_req_blocked...............: avg=23.5ms   min=0s       med=0s    max=731.46ms p(90)=0s    p(95)=510.49µs
     http_req_connecting............: avg=14.79ms  min=0s       med=0s    max=343.54ms p(90)=0s    p(95)=0s
     http_req_duration..............: avg=2.81s    min=3.12ms   med=2.8s  max=10.18s   p(90)=4.82s p(95)=5.07s
       { expected_response:true }...: avg=2.81s    min=313.71ms med=2.81s max=10.18s   p(90)=4.83s p(95)=5.07s
     http_req_failed................: 0.26% 117 out of 43956
     http_req_receiving.............: avg=468.21µs min=0s       med=0s    max=14.93ms  p(90)=987µs p(95)=2.21ms
     http_req_sending...............: avg=21.26µs  min=0s       med=0s    max=8.52ms   p(90)=0s    p(95)=0s
     http_req_tls_handshaking.......: avg=0s       min=0s       med=0s    max=0s       p(90)=0s    p(95)=0s
     http_req_waiting...............: avg=2.81s    min=3.12ms   med=2.8s  max=10.18s   p(90)=4.82s p(95)=5.07s
     http_reqs......................: 43956 350.979203/s
     iteration_duration.............: avg=2.83s    min=13.56ms  med=2.82s max=10.18s   p(90)=4.85s p(95)=5.09s
     iterations.....................: 43956 350.979203/s
     vus............................: 10    min=10           max=1024
     vus_max........................: 1024  min=1024         max=1024

                                                                                                                                                                                                                                                      
running (2m05.2s), 0000/1024 VUs, 43956 complete and 0 interrupted iterations                                                                                                                                                                         
default ✓ [======================================] 1024 VUs  2m0s

During this period, I encounter numerous empty responses, and there are no error logs in the ingress-nginx-controller pod. However, if a TCP connection has been established prior to this, it remains uninterrupted (tested it by telnet ${my-tcp-service} ${port} command).

So I want to confirm if it's the upgrade caused short-lived service interruption of the ingress-nginx-controller?

What you expected to happen:

No warnings should occur throughout the upgrade process, and any requests should be handled whether or not the returned status code is 200.

NGINX Ingress controller version (exec into the pod and run /nginx-ingress-controller --version): v1.11.2 & v1.11.3

Kubernetes version (use kubectl version):

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.5

Environment:

  • Cloud provider or hardware configuration: I used Gardener to control all clusters, so I have no permissions to check it.

  • OS (e.g. from /etc/os-release): linux-amd64

  • Kernel (e.g. uname -a):

  • Install tools:

    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:

    • kubectl get nodes -o wide
    $ kubectl get nodes -o wide
    NAME                                                       STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE              KERNEL-VERSION       CONTAINER-RUNTIME
    shoot--gtlcdevqa--dylan-test-worker-tt7i7-z1-6db57-25mdw   Ready    <none>   88m   v1.30.5   10.180.0.213   <none>        Garden Linux 1592.3   6.6.62-cloud-amd64   containerd://1.7.20
    shoot--gtlcdevqa--dylan-test-worker-tt7i7-z1-6db57-mn8zv   Ready    <none>   89m   v1.30.5   10.180.0.187   <none>        Garden Linux 1592.3   6.6.62-cloud-amd64   containerd://1.7.20
  • How was the ingress-nginx-controller installed:

    • If helm was used then please show output of helm ls -A | grep -i ingress
    $ helm ls -A | grep -i ingress
    ingress-nginx           ingress-nginx           28              2024-11-18 16:34:27.1373854 +0800 CST   deployed    ingress-nginx-4.11.3            1.11.3
    • If helm was used then please show output of helm -n <ingresscontrollernamespace> get values <helmreleasename>
    $ helm -n ingress-nginx get values ingress-nginx
    USER-SUPPLIED VALUES:
    controller:
      allowSnippetAnnotations: true
      config:
        client-body-timeout: "360"
        proxy-body-size: 1024m
        proxy-buffer-size: 16k
        proxy-connect-timeout: "30"
        proxy-read-timeout: "3600"
        proxy-send-timeout: "900"
        proxy-set-headers: ingress-nginx/custom-headers
      extraArgs:
        configmap: $(POD_NAMESPACE)/ingress-nginx-controller
        controller-class: k8s.io/ingress-nginx
        default-ssl-certificate: ingress-nginx/gtlconlycert
        enable-ssl-passthrough: "true"
        ingress-class: nginx
        publish-service: $(POD_NAMESPACE)/ingress-nginx-controller
        tcp-services-configmap: $(POD_NAMESPACE)/ingress-nginx-tcp
        validating-webhook: :8443
        validating-webhook-certificate: /usr/local/certificates/cert
        validating-webhook-key: /usr/local/certificates/key
        watch-ingress-without-class: "true"
      metrics:
        enabled: true
        service:
          annotations:
            prometheus.io/port: "10254"
            prometheus.io/scrape: "true"
        serviceMonitor:
          enabled: true
          namespace: kube-prometheus-stack
          scrapeInterval: 500ms
    tcp:
      "31080": prod/blackduck-report:1081
  • Current State of the controller:

    • kubectl describe ingressclasses
    Name:         nginx
    Labels:       app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=ingress-nginx
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=ingress-nginx
                  app.kubernetes.io/part-of=ingress-nginx
                  app.kubernetes.io/version=1.11.3
                  helm.sh/chart=ingress-nginx-4.11.3
    Annotations:  meta.helm.sh/release-name: ingress-nginx
                  meta.helm.sh/release-namespace: ingress-nginx
    Controller:   k8s.io/ingress-nginx
    Events:       <none>
    • kubectl -n <ingresscontrollernamespace> get all -A -o wide
    $ kubectl -n ingress-nginx get all -o wide
    NAME                                            READY   STATUS    RESTARTS   AGE     IP            NODE                                                       NOMINATED NODE   READINESS GATES
    pod/ingress-nginx-controller-67fbb67c7b-tpfpt   1/1     Running   0          3d22h   100.64.1.23   shoot--gtlcdevqa--dylan-test-worker-tt7i7-z1-6db57-25mdw   <none>           <none>
    
    NAME                                         TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                      AGE   SELECTOR
    service/ingress-nginx-controller             LoadBalancer   100.111.24.47    10.47.104.129   80:31686/TCP,443:32033/TCP,31080:31568/TCP   25d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
    service/ingress-nginx-controller-admission   ClusterIP      100.106.5.80     <none>          443/TCP                                      25d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
    service/ingress-nginx-controller-metrics     ClusterIP      100.110.133.77   <none>          10254/TCP                                    14d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
    
    NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                                                                                                   
      SELECTOR
    deployment.apps/ingress-nginx-controller   1/1     1            1           25d   controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
    
    NAME                                                  DESIRED   CURRENT   READY   AGE    CONTAINERS   IMAGES                                                                                                            
             SELECTOR
    replicaset.apps/ingress-nginx-controller-56bcbbf9bc   0         0         0       4d1h   controller   registry.k8s.io/ingress-nginx/controller:v1.11.2@sha256:d5f8217feeac4887cb1ed21f27c2674e58be06bd8f5184cacea2a69abaf78dce   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=56bcbbf9bc
    replicaset.apps/ingress-nginx-controller-67fbb67c7b   1         1         1       4d1h   controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=67fbb67c7b
    • kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
    $ kubectl describe po -n ingress-nginx ingress-nginx-controller-67fbb67c7b-tpfpt
    Name:             ingress-nginx-controller-67fbb67c7b-tpfpt
    Namespace:        ingress-nginx
    Priority:         0
    Service Account:  ingress-nginx
    Node:             shoot--gtlcdevqa--dylan-test-worker-tt7i7-z1-6db57-25mdw/10.180.0.213
    Start Time:       Fri, 22 Nov 2024 16:11:19 +0800
    Labels:           app.kubernetes.io/component=controller
                      app.kubernetes.io/instance=ingress-nginx
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=ingress-nginx
                      app.kubernetes.io/part-of=ingress-nginx
                      app.kubernetes.io/version=1.11.3
                      helm.sh/chart=ingress-nginx-4.11.3
                      pod-template-hash=67fbb67c7b
    Annotations:      cni.projectcalico.org/containerID: 6b2b57de91e25a2c7dbdac5dc865f7c3c09ae62b4b1a1269a1eb4c3070328020
                      cni.projectcalico.org/podIP: 100.64.1.23/32
                      cni.projectcalico.org/podIPs: 100.64.1.23/32
    Status:           Running
    IP:               100.64.1.23
    IPs:
      IP:           100.64.1.23
    Controlled By:  ReplicaSet/ingress-nginx-controller-67fbb67c7b
    Containers:
      controller:
        Container ID:    containerd://cd4e18fc7e76caaabc2fed13acd26af7fef665f2e01a645503c3d8661a091831
        Image:           registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7
        Image ID:        registry.k8s.io/ingress-nginx/controller@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7
        Ports:           80/TCP, 443/TCP, 10254/TCP, 8443/TCP, 31080/TCP
        Host Ports:      0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
        SeccompProfile:  RuntimeDefault
        Args:
          /nginx-ingress-controller
          --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
          --election-id=ingress-nginx-leader
          --controller-class=k8s.io/ingress-nginx
          --ingress-class=nginx
          --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
          --tcp-services-configmap=$(POD_NAMESPACE)/ingress-nginx-tcp
          --validating-webhook=:8443
          --validating-webhook-certificate=/usr/local/certificates/cert
          --validating-webhook-key=/usr/local/certificates/key
          --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
          --controller-class=k8s.io/ingress-nginx
          --default-ssl-certificate=ingress-nginx/gtlconlycert
          --enable-ssl-passthrough=true
          --ingress-class=nginx
          --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
          --tcp-services-configmap=$(POD_NAMESPACE)/ingress-nginx-tcp
          --validating-webhook=:8443
          --validating-webhook-certificate=/usr/local/certificates/cert
          --validating-webhook-key=/usr/local/certificates/key
          --watch-ingress-without-class=true
        State:          Running
          Started:      Fri, 22 Nov 2024 16:13:05 +0800
        Ready:          True
        Restart Count:  0
        Requests:
          cpu:      100m
          memory:   90Mi
        Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5        
        Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3        
        Environment:
          POD_NAME:                 ingress-nginx-controller-67fbb67c7b-tpfpt (v1:metadata.name)
          POD_NAMESPACE:            ingress-nginx (v1:metadata.namespace)
          LD_PRELOAD:               /usr/local/lib/libmimalloc.so
          KUBERNETES_SERVICE_HOST:  api.dylan-test.gtlcdevqa.internal.canary.k8s.ondemand.com
        Mounts:
          /usr/local/certificates/ from webhook-cert (ro)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-d2v66 (ro)
    Conditions:
      Type                        Status
      PodReadyToStartContainers   True
      Initialized                 True
      Ready                       True
      ContainersReady             True
      PodScheduled                True
    Volumes:
      webhook-cert:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  ingress-nginx-admission
        Optional:    false
      kube-api-access-d2v66:
        Type:                    Projected (a volume that contains injected data from multiple sources)
        TokenExpirationSeconds:  3607
        ConfigMapName:           kube-root-ca.crt
        ConfigMapOptional:       <nil>
        DownwardAPI:             true
    QoS Class:                   Burstable
    Node-Selectors:              kubernetes.io/os=linux
    Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:                      <none>
    • kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
    $ kubectl -n ingress-nginx describe svc ingress-nginx-controller
    Name:                     ingress-nginx-controller
    Namespace:                ingress-nginx
    Labels:                   app.kubernetes.io/component=controller
                              app.kubernetes.io/instance=ingress-nginx
                              app.kubernetes.io/managed-by=Helm
                              app.kubernetes.io/name=ingress-nginx
                              app.kubernetes.io/part-of=ingress-nginx
                              app.kubernetes.io/version=1.11.3
                              helm.sh/chart=ingress-nginx-4.11.3
    Annotations:              loadbalancer.openstack.org/load-balancer-address: 10.47.104.129
                              loadbalancer.openstack.org/load-balancer-id: 54ef842a-05c0-482a-b3bf-255012af91d8 
                              meta.helm.sh/release-name: ingress-nginx
                              meta.helm.sh/release-namespace: ingress-nginx
    Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
    Type:                     LoadBalancer
    IP Family Policy:         SingleStack
    IP Families:              IPv4
    IP:                       100.111.24.47
    IPs:                      100.111.24.47
    LoadBalancer Ingress:     10.47.104.129
    Port:                     http  80/TCP
    TargetPort:               http/TCP
    NodePort:                 http  31686/TCP
    Endpoints:                100.64.1.23:80
    Port:                     https  443/TCP
    TargetPort:               https/TCP
    NodePort:                 https  32033/TCP
    Endpoints:                100.64.1.23:443
    Port:                     31080-tcp  31080/TCP
    TargetPort:               31080-tcp/TCP
    NodePort:                 31080-tcp  31568/TCP
    Endpoints:                100.64.1.23:31080
    Session Affinity:         None
    External Traffic Policy:  Cluster
    Events:                   <none>
  • Current state of ingress object, if applicable:

    • kubectl -n <appnamespace> get all,ing -o wide
    $ kubectl -n web-service get ingress -owide
    NAME                      CLASS    HOSTS                      ADDRESS         PORTS   AGE
    web-service-gin-ingress   <none>   my-host   10.47.104.129   80      8d
    • kubectl -n <appnamespace> describe ing <ingressname>
    $ kubectl describe ingress web-service-gin-ingress -n web-service 
    Name:             web-service-gin-ingress
    Labels:           <none>
    Namespace:        web-service
    Address:          10.47.104.129
    Ingress Class:    <none>
    Default backend:  <default>
    Rules:
      Host                      Path  Backends
      ----                      ----  --------
      my-host
                                /   web-service-gin-service:8080 (100.64.1.4:8080,100.64.1.5:8080,100.64.1.6:8080)
    Annotations:                nginx.ingress.kubernetes.io/configuration-snippet: more_set_headers "X-Ingress-Pod-Name: $HOSTNAME";
    Events:                     <none>
    • If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
$ GUID=1
$ DATETIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
$ curl -X POST "http://my-host/v1/tests/post" -H "Content-Type: application/json" -d "{
    \"id\": \"$GUID\", 
    \"create_time\": \"$DATETIME\",
    \"sleep_time_ms\": 10
}"
{"id":"1","ingress_pod_name_form":"ingress-nginx-controller-67fbb67c7b-tpfpt","create_time":"2024-11-22T09:58:05Z","receive_time":"2024-11-22T09:58:52.425334756Z","finish_time":"2024-11-22T09:58:52.435498011Z","consume_sec":0.010163236}
$ curl -vX POST "http://my-host/v1/tests/post" -H "Content-Type: application/json" -d "{   
    \"id\": \"$GUID\",
    \"create_time\": \"$DATETIME\",
    \"sleep_time_ms\": 10
}"
Note: Unnecessary use of -X or --request, POST is already inferred.
* Host my-host:80 was resolved.
* IPv6: (none)
* IPv4: 10.47.104.129
*   Trying 10.47.104.129:80...
* Connected to dylan-test.gtlc.only.sap (10.47.104.129) port 80
* using HTTP/1.x
> POST /v1/tests/post HTTP/1.1
> Host: dylan-test.gtlc.only.sap
> User-Agent: curl/8.10.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 87
>
* upload completely sent off: 87 bytes
< HTTP/1.1 200 OK
< Date: Fri, 22 Nov 2024 10:00:18 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 236
< Connection: keep-alive
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Headers: Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token, Authorization, accept, origin, Cache-Control, X-Requested-With
< Access-Control-Allow-Methods: POST, OPTIONS, GET, PUT, DELETE
< Access-Control-Allow-Origin: *
< X-Ingress-Pod-Name-From: ingress-nginx-controller-67fbb67c7b-tpfpt
< X-Ingress-Pod-Name: ingress-nginx-controller-67fbb67c7b-tpfpt
<
{"id":"1","ingress_pod_name_form":"ingress-nginx-controller-67fbb67c7b-tpfpt","create_time":"2024-11-22T09:58:05Z","receive_time":"2024-11-22T10:00:18.300485598Z","finish_time":"2024-11-22T10:00:18.310760665Z","consume_sec":0.010275065}* Connection #0 to host my-host left intact
  • Others:
    • Any other related information like ;
      • copy/paste of the snippet (if applicable)
      • kubectl describe ... of any custom configmap(s) created and in use
      • Any other related information that may help

How to reproduce this issue:

To reproduce it, you just need one web-service (any pod can receive HTTP request is ok). Then you can use this K6 script:

import http from 'k6/http';
import { uuidv4 } from 'https://jslib.k6.io/k6-utils/1.4.0/index.js';

export const options = {
  vus: 1024,
  duration: '120s',
};

function getFormattedDateTimeNow() {
  const now = new Date();
  const isoString = now.toISOString();

  return isoString;
}

function formattedResponseOutput(res) {
  const status = res.status;
  const statusText = res.status_text;
  const to = res.headers['X-Ingress-Pod-Name'];
  const from = res.headers['X-Ingress-Pod-Name-From'];

  if (res.status != 200) {
    console.log(`[${from}] --> [${to}] : { Status: ${status}, Status Text: ${statusText} }`);
  } else {
    console.log(`[${from}] --> [${to}] : { Status: ${status}, ResponseBody: ${res.body} }`);
  }
}

export default function () {
  const url = 'http://my-host/v1/tests/post';
  const sleep_upper_limit_ms = 5000

  const playload = JSON.stringify({
    "id": uuidv4(),
    "create_time": getFormattedDateTimeNow(),
    "sleep_time_ms": Math.floor(Math.random() * (sleep_upper_limit_ms + 1)), 
  })

  const params = {
    headers: {
      'Content-Type': 'application/json',
    },
  };

  const res = http.post(url, playload, params);
  formattedResponseOutput(res);
}

Anything else we need to know:

You can use my test image implemented by Go: image: doublebiao/web-service-gin:v1.0-beta

@dbbDylan dbbDylan added the kind/bug Categorizes issue or PR as related to a bug. label Nov 22, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 22, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@longwuyuan
Copy link
Contributor

/remove-kind bug
/kind support

You have only one pod of the controller so yes, you will get brief disruption during upgrade.

You can experiment with more than one replicas and the values for minAvailable etc.

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Nov 22, 2024
@dbbDylan
Copy link
Author

Thanks for your support! @longwuyuan

As your suggestions here, I try to change my value-specific.yaml:

+    replicaCount: 2

But the same error still occurred when the old pod switch to terminating:

image

I also try to add sleep 15 before executing the wait-shutdown, but it also not work.

@longwuyuan
Copy link
Contributor

Those are not the only values. Please explore others.
Each use case is specific . For example I suggested but your response says you tried only one of my suggestions. Like increase replicas to maybe 3 and set minAvailable to 1 https://kubernetes.io/docs/tasks/run-application/configure-pdb/. This is for having at least 1 pod for new conections

If its about graceful draining of established connections, then please look at other such config options for timers etc. There is no well-documented use case with the controller for this. Each user finds their most suitable config by trial and error.

@dbbDylan
Copy link
Author

dbbDylan commented Nov 25, 2024

I've tried a lot of ways:

  • replicaCount: 2 and minAvailable: 1
  • replicaCount: 3 and minAvailable: 1
  • replicaCount: 3 and minAvailable: 2
  • replicaCount: 1 and minAvailable: 1 and preStop: ["/bin/sh", "-c", "sleep 15s && /wait-shutdown"]

All of them are not works.

However, I have found that all the errors are coming from the old pod when executing the “wait-shutdown” script. The old pod still receives messages when the controller is shutting down and before nginx terminates, but this is not as expected:

image

So I don't think it's a configuration issue, but rather a brief service interruption during graceful termination. In my opinion, the expected process maybe like:

  1. Graceful termination started.
  2. Network traffic changed.
  3. (Old pod stops receiving requests) Nginx service stopped.
  4. Old pod deleted.

But the current stage can't guarantee the second step happened before the third step. Could you double-check it?

Thanks for your strong support again.

@dbbDylan
Copy link
Author

func (srv *Server) ListenAndServe() error {
	if srv.shuttingDown() {
		return ErrServerClosed  // the fatal error
	}
	addr := srv.Addr
	if addr == "" {
		addr = ":http"
	}
	ln, err := net.Listen("tcp", addr)
	if err != nil {
		return err
	}
	return srv.Serve(ln)
}

@dbbDylan
Copy link
Author

More information updated:

Once a pod transitions from Running to Terminating, the Endpoint associated with the ingress-nginx-controller
Service should have completed its IP change. Therefore, I suspect that the issue might not be with the ingress-nginx-controller itself, but rather with the way the k6 load testing tool is handling connections. Could you help me confirm this hypothesis?

Copy link

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

3 participants