Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575

Open
IAMSolaara opened this issue Dec 28, 2024 · 23 comments
Open

CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575

IAMSolaara opened this issue Dec 28, 2024 · 23 comments
Labels

Comments

@IAMSolaara
Copy link

What happened:
Using the command
k virt image-upload dv oi-hipster-gui-20240426-iso --image-path=./OI-hipster-gui-20240426.iso --size 3Gi --volume-mode filesystem --access-mode ReadWriteOnce --force-bind --uploadproxy-url=https://172.16.8.132:443/ --insecure
any upload gets terminated, returning
unexpected return value 502, error in upload-proxy: http: proxy error: write tcp 10.244.0.168:33826->10.99.16.191:443: write: connection reset by peer.

Looking into Talos' dashboard I saw that the cdi-upload-server container was killed by oom_reaper.

I get this every single time I try to upload this image.

What you expected to happen:
I expected the transfer to go successfully

How to reproduce it (as minimally and precisely as possible):

  1. Issue the command I mentioned.
  2. The transfer starts and gets interrupted with the above-mentioned error.

Additional context:
The machine has more than enough RAM available to do the job (or at least I think so):
image

I have tried exposing the upload proxy both as a LoadBalancer and through a TLSRoute (since I initially thought the gateway controller was acting up and not OOM).

I have a dump of the error message:
dmesg-oom.txt

Environment:

  • CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.61.0
  • Kubernetes version (use kubectl version): v1.29.11
  • DV specification: N/A
  • Cloud provider or hardware configuration: Intel Core i5-8600K, 24GB of DDR4 RAM, bare-metal Talos installation
  • OS (e.g. from /etc/os-release): Talos Linux v1.9.0
  • Kernel (e.g. uname -a): 6.12.5-talos
  • Install tools: I installed CDI following the guide at https://kubevirt.io/user-guide/storage/containerized_data_importer/#install-cdi
  • Others: N/A
@akalenyu
Copy link
Collaborator

akalenyu commented Dec 29, 2024

Hey, any chance this setup is using cgroupsv1? with cgroupsv1, the total host's available memory gets taken into account so overwhelming the pod limits was easy. Anyway, cgroupsv1 support is dropped by now from the k8s side IIRC.
If it's not, check this out #3557 (comment)

@IAMSolaara
Copy link
Author

I can confirm I'm not using cgroupsv1.

Talos defaults to always using the unified cgroup hierarchy (cgroupsv2), but cgroupsv1 can be forced with talos.unified_cgroup_hierarchy=0.

I don't have this flag set so I assume this is using v2.

As for the StorageClass, I'm using the Rancher hostPath CSI, I might try using other ones like the openebs local pv CSI, though I doubt that changes.

I'll see if I can try applying that patch to talos' kernel, so this might eventually get fixed in v1.10.0 whenever that comes out.

@akalenyu
Copy link
Collaborator

Yeah if you have the sources should be easy enough to check if the revert made it in, super small change
https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4523/commits

@akalenyu
Copy link
Collaborator

Another question, what are the dirty_rate values for talos?
sudo sysctl -a | grep dirty

@IAMSolaara
Copy link
Author

Another question, what are the dirty_rate values for talos? sudo sysctl -a | grep dirty

Ran this command to check those values:

talosctl ls /proc/sys/vm -n 172.16.1.250 
	| parse '{node} {name}' | select name 
	| where {|it| $it.name | str contains 'dirty'} | str trim 
	| each {|it| insert value (talosctl cat /proc/sys/vm/($it.name) -n 172.16.1.250) } 

I got these values:
image

@akalenyu
Copy link
Collaborator

As for the StorageClass, I'm using the Rancher hostPath CSI

So usually I wouldn't expect that kernel issue with local storage on the node, but, it's possible that the hostpath driver could be configured to a path that's backed by something else entirely. I know this is possible with https://github.com/kubevirt/hostpath-provisioner-operator.

Usually, the OOMs due to the kernel bug would occur with NFS/slow disks
https://lore.kernel.org/lkml/ZeEhvV15IWllPKvM@chrisdown.name/T/

@IAMSolaara
Copy link
Author

IAMSolaara commented Dec 29, 2024

Usually, the OOMs due to the kernel bug would occur with NFS/slow disks

I see. I'm using a kinda old SATA SSD as the disk backing that local-storage but I can try getting a localStorage SC that uses the NVMe boot drive, at least so I can get this to work.

Talos' build stuff doesn't seem to play nice with Podman and I'm currently not at home so fiddling around with that right now is a bit problematic 😅

@IAMSolaara
Copy link
Author

Just tried using the NVMe and I still get issues. I'll see if I can get that kernel patch installed.

@akalenyu
Copy link
Collaborator

Just tried using the NVMe and I still get issues. I'll see if I can get that kernel patch installed.

Hmm interesting, didn't expect that. Did you set up rancher to use that disk?
BTW could you just talosctl ls /sys/fs/cgroup to make sure we're working with cgroupsv2

@IAMSolaara
Copy link
Author

Just tried using the NVMe and I still get issues. I'll see if I can get that kernel patch installed.

Hmm interesting, didn't expect that. Did you set up rancher to use that disk?

Yep, the files live on the NVMe and I could see disk activity reflect that:
image

BTW could you just talosctl ls /sys/fs/cgroup to make sure we're working with cgroupsv2

Sure, here's the output:

talosctl ls /sys/fs/cgroup -n 172.16.1.250
NODE           NAME
172.16.1.250   .
172.16.1.250   cgroup.controllers
172.16.1.250   cgroup.max.depth
172.16.1.250   cgroup.max.descendants
172.16.1.250   cgroup.pressure
172.16.1.250   cgroup.procs
172.16.1.250   cgroup.stat
172.16.1.250   cgroup.subtree_control
172.16.1.250   cgroup.threads
172.16.1.250   cpu.pressure
172.16.1.250   cpu.stat
172.16.1.250   cpu.stat.local
172.16.1.250   cpuset.cpus.effective
172.16.1.250   cpuset.cpus.isolated
172.16.1.250   cpuset.mems.effective
172.16.1.250   init
172.16.1.250   io.pressure
172.16.1.250   io.stat
172.16.1.250   kubepods
172.16.1.250   memory.numa_stat
172.16.1.250   memory.pressure
172.16.1.250   memory.reclaim
172.16.1.250   memory.stat
172.16.1.250   podruntime
172.16.1.250   system

@akalenyu
Copy link
Collaborator

Yeah definitely cgroupsv2. Can you reproduce this with dd and containers?
Something like

podman run -m 600m --mount type=bind,source=/mnt/oom-nfs,target=/disk --rm -it quay.io/centos/centos:stream9 bash
dd if=/dev/urandom of=/disk/2G.bin bs=32K count=65536 status=progress iflag=fullblock

@IAMSolaara
Copy link
Author

I don't have podman on Talos, but I should have made an equivalent enough manifest:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: example-pvc
spec:
  storageClassName: local-path-ephemeral
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dd-test-thingy
spec:
  selector:
    matchLabels:
      name: dd-test-thingy
  replicas: 1
  strategy:
    type: Recreate
    rollingUpdate: null
  template:
    metadata:
      labels:
        name: dd-test-thingy
    spec:
      containers:
        - name: thingy
          resources:
            limits:
              memory: 600M
          image: quay.io/centos/centos:stream9
          command: ["/bin/bash"]
          args: ["-c", "while true ;do sleep 50; done"] # I'll just kubectl exec into it and run dd so I can properly look at it do stuff
          volumeMounts:
            - mountPath: /datadir
              name: dd-test-thingy-nvme-vol
      volumes:
        - name: dd-test-thingy-nvme-vol
          persistentVolumeClaim:
            claimName: example-pvc

Will post results ASAP

@IAMSolaara
Copy link
Author

Ok, dd also got killed by OOM:

[root@dd-test-thingy-799cd97974-zldtg /]# dd if=/dev/urandom of=/datadir/2G.bin bs=32K count=65536 status=progress iflag=fullblock
421593088 bytes (422 MB, 402 MiB) copied, 1 s, 422 MB/scommand terminated with exit code 137

And here's the dmesg dump:
oom-2.txt

@akalenyu
Copy link
Collaborator

Yup looks exactly like that kernel bug. I am surprised it happens with storage that should be able to flush the data to disk quite quickly

@IAMSolaara
Copy link
Author

I looked into it and seems they released a new version of Talos running kernel 6.12.6.
It seems like they use a vanilla kernel? https://github.com/siderolabs/pkgs/blob/45c4ba4957b013015a5b1457162b1659a2149712/Pkgfile#L75-L78
I tried digging in and I can't quite point whether they have that patch in...
I'm gonna try to update Talos and see if this is fixed.

@IAMSolaara
Copy link
Author

Update done. Test passes and no OOM reapers in sight 🎉
image

Gonna try using CDI and report back but I think we're in the clear.

@IAMSolaara
Copy link
Author

I guess I spoke too soon. Uploading an image still gives the same problem, both on SATA and the NVMe SSDs.

dmesg-191.txt

@akalenyu
Copy link
Collaborator

Update done. Test passes and no OOM reapers in sight 🎉 image

Gonna try using CDI and report back but I think we're in the clear.

Are you sure it wasn't just a lucky pass? or does it consistently not OOM?

@IAMSolaara
Copy link
Author

I did a few passes and they consistently passed. Now I consistently get OOM'd...

I got a little too excited 😅

@akalenyu
Copy link
Collaborator

I did a few passes and they consistently passed. Now I consistently get OOM'd...

I got a little too excited 😅

From a quick check https://github.com/gregkh/linux does not have the revert commit which centos does

@akalenyu
Copy link
Collaborator

Huh, just bumped into a fresh 6.13 RC commit that probably tackles the same issue without a revert
gregkh/linux@1bc542c

@IAMSolaara
Copy link
Author

I'll look out for new Talos releases and report back in case anything changes. I will work around this using DataVolume's import feature for now.

@akalenyu
Copy link
Collaborator

I'll look out for new Talos releases and report back in case anything changes. I will work around this using DataVolume's import feature for now.

Yeah you can either do that, or, find a way to rate limit the upload (I am assuming it's blazing fast, so it being slow could work around the OOM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants