Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing e2e test jobs after ControlPlaneKubeletLocalMode enabled by default #3154

Closed
neolit123 opened this issue Jan 29, 2025 · 10 comments
Closed
Assignees
Labels
area/feature-gates kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Jan 29, 2025

i suspect it's

because the other PR after is cosmetic (klog change)

failing jobs:

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-dryrun-latest/1884502182036770816/build-log.txt

[etcd] Would wait for the new etcd member to join the cluster
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is not healthy after 4m0.001133273s

Unfortunately, an error has occurred:
	The HTTP call equal to 'curl -sSL http://127.0.0.1:10248/healthz' returned error: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-external-ca-latest/1884261091186315264/build-log.txt

I0129 09:33:28.655624     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.656811     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.657129     245 kubelet.go:337] [kubelet-start] preserving the crisocket information for the node
I0129 09:33:28.657219     245 patchnode.go:32] [patchnode] Uploading the CRI socket "unix:///run/containerd/containerd.sock" to Node "kinder-external-ca-control-plane-2" as an annotation
...
I0128 15:30:34.908571     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
I0128 15:30:35.408661     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
Get "https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s": dial tcp 172.17.0.5:6443: connect: connection refused
error writing CRISocket for this node
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletWaitBootstrapPhase
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:339
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:261
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run

both cases need investigation. in one case it seems it's not reaching the kubelet and the other the apiserver.
don't seem like flakes as it failed consistently N times. these jobs are a bit uncommon, i.e. they do custom actions like dry-run/external ca.

the regular job is green:

cc @chrischdi

@neolit123 neolit123 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. area/feature-gates labels Jan 29, 2025
@neolit123 neolit123 added this to the v1.33 milestone Jan 29, 2025
@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

@chrischdi and the dedicated fg=false job also started failing, oddly:
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-control-plane-kubelet-local-mode-latest

edit: actually this one is clearer. this needs update:


# task-09-post-upgrade
/bin/bash -c set -x

IP_ADDRESS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' kinder-control-plane-local-kubelet-mode-lb)"

CMD="docker exec kinder-control-plane-local-kubelet-mode-control-plane-1"
${CMD} grep "server: https://${IP_ADDRESS}:6443" /etc/kubernetes/kubelet.conf || exit 1

CMD="docker exec kinder-control-plane-local-kubelet-mode-control-plane-2"
${CMD} grep "server: https://${IP_ADDRESS}:6443" /etc/kubernetes/kubelet.conf || exit 1

CMD="docker exec kinder-control-plane-local-kubelet-mode-control-plane-3"
${CMD} grep "server: https://${IP_ADDRESS}:6443" /etc/kubernetes/kubelet.conf || exit 1

# Ensure exit status of 0
exit 0


++ docker inspect '--format={{ .NetworkSettings.IPAddress }}' kinder-control-plane-local-kubelet-mode-lb
+ IP_ADDRESS=172.17.0.7
+ CMD='docker exec kinder-control-plane-local-kubelet-mode-control-plane-1'
+ docker exec kinder-control-plane-local-kubelet-mode-control-plane-1 grep 'server: https://172.17.0.7:6443' /etc/kubernetes/kubelet.conf
+ exit 1
 exit status 1

@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-external-ca-latest/1884261091186315264/build-log.txt

I0129 09:33:28.655624     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.656811     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.657129     245 kubelet.go:337] [kubelet-start] preserving the crisocket information for the node
I0129 09:33:28.657219     245 patchnode.go:32] [patchnode] Uploading the CRI socket "unix:///run/containerd/containerd.sock" to Node "kinder-external-ca-control-plane-2" as an annotation
...
I0128 15:30:34.908571     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
I0128 15:30:35.408661     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
Get "https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s": dial tcp 172.17.0.5:6443: connect: connection refused
error writing CRISocket for this node
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletWaitBootstrapPhase
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:339
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:261
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run

external ca calls a kinder action setup-external-ca
https://github.com/kubernetes/kubeadm/blob/main/kinder/ci/workflows/external-ca-tasks.yaml#L56

it needs to be updated because it uses a naive approach to generate the same kubelet.conf on both workers and CP nodes
https://github.com/kubernetes/kubeadm/blob/main/kinder/pkg/cluster/manager/actions/setup-external-ca.go#L111

without that the kublet.conf will point to a non-existing local apiserver on worker nodes. instead it should point to lb.
the culprit is kubeadm init phase kubeconfig kubelet --control-plane-endpoint=%s --v=%d", where the CPE should be the LB.

i don't think there is a bigger issue here, i.e. we don't need to patch k/k.

edit: hmm but, --control-plane-endpoint=%s is already the lb IP according to the kinder source, but the file ends up with 172.17.0.5 which is the worker IP and there is no apiserver there at port 6443.

@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

tested locally.

sudo kubeadm init phase certs ca
sudo kubeadm init phase kubeconfig all --control-plane-endpoint=foo.bar --v=5
sudo cat /etc/kubernetes/kubelet.conf | grep server
    server: https://192.168.0.101:6443

so that's a regression. we need to think how the kubelet local mode will continue to respect the user prodided clusterconfiguration.controlplaneendpoint or flag.

i will send revert PR for

until we fix all these issues.

edit: here it is:

@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-external-ca-latest/1884261091186315264/build-log.txt

I0129 09:33:28.655624     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.656811     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.657129     245 kubelet.go:337] [kubelet-start] preserving the crisocket information for the node
I0129 09:33:28.657219     245 patchnode.go:32] [patchnode] Uploading the CRI socket "unix:///run/containerd/containerd.sock" to Node "kinder-external-ca-control-plane-2" as an annotation
...
I0128 15:30:34.908571     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
I0128 15:30:35.408661     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
Get "https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s": dial tcp 172.17.0.5:6443: connect: connection refused
error writing CRISocket for this node
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletWaitBootstrapPhase
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:339
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:261
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run

this issue seems to be that runKubeletWaitBootstrapPhase assumes there is a real kubelet running
https://github.com/kubernetes/kubernetes/blob/3bc8f01c74e80cb85e6f3813db1b410adba22bfe/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L285
yet, during join dryrun, one is never started
https://github.com/kubernetes/kubernetes/blob/3bc8f01c74e80cb85e6f3813db1b410adba22bfe/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L258

perhaps we should wrap the waiting

if dryrun {
  // print would wait for kubelet
} else {
  // wait
}

@chrischdi
Copy link
Member

I'm planning to take a look at this next week.

/assign

@chrischdi
Copy link
Member

Trying to iterate on the three issues which I call:

  1. kinder dry-run
  2. kinder external-ca
  3. kinder fg-disabled is failing

1. kinder dry-run

That is easy fixable and needs to be done in k/k.

With the feature-gate disabled and when having dry-run, we:

With the feature-gate enabled we directly run runKubeletWaitBootstrapPhase, so I propose to add an early return to that function too.

Example fix: chrischdi/kubernetes@65839db

2. kinder external-ca

tested locally.

sudo kubeadm init phase certs ca
sudo kubeadm init phase kubeconfig all --control-plane-endpoint=foo.bar --v=5
sudo cat /etc/kubernetes/kubelet.conf | grep server
    server: https://192.168.0.101:6443

so that's a regression. we need to think how the kubelet local mode will continue to respect the user prodided clusterconfiguration.controlplaneendpoint or flag.

i will send revert PR for

until we fix all these issues.

edit: here it is:

I'm not sure if this is a regression or the wanted outcome of the feature gate instead.
In this example case kube-scheduler and kube-controller-manager also do not point to foo.bar, example:

$ kubeadm init phase certs ca
$ kubeadm init phase kubeconfig all --control-plane-endpoint=foo.bar --v=5
$ cat /etc/kubernetes/controller-manager.conf | grep server
    server: https://172.17.0.3:6443
$ cat /etc/kubernetes/scheduler.conf | grep server
    server: https://172.17.0.3:6443
$ cat /etc/kubernetes/admin.conf | grep server
    server: https://foo.bar:6443

3. kinder fg-disabled is failing

I'm now taking a look into this.

@neolit123
Copy link
Member Author

Example fix: chrischdi/kubernetes@65839db

makes sense.

I'm not sure if this is a regression or the wanted outcome of the feature gate instead.
In this example case kube-scheduler and kube-controller-manager also do not point to foo.bar, example:

historically the kcm and scheduler have been hardcoded to point to the local ip.
the admin.conf and kubelet.conf however received its server field value from the cpe passed by the user.
so given we are now planning to hardcode the kubelet.conf to also point to local ip, that breaks users who assumed that calling the kubelet kubeconfig phase of init on demand would give them a kubeconfig with a cpe server.

it seems to me this breaking change is inevitable, but it should be mentioned in the release note of the graduation pr.

one place where this break is kinder, like i mentioned earlier. so for the external ca workflow to pass this must be fixed here:
https://github.com/kubernetes/kubeadm/blob/main/kinder/pkg/cluster/manager/actions/setup-external-ca.go#L111
(have two types of kubeconfigs - point to local ip on cp nodes and point to lb on worker nodes)

@chrischdi
Copy link
Member

chrischdi commented Feb 4, 2025

@neolit123
Copy link
Member Author

thanks for the fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/feature-gates kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

2 participants