-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaler slow memory leak #15624
Comments
Hi @DavidR91, Could you show more about the pod status (
Could you provide the logs of the autoscaler? You can enable profiling as follows. On one terminal:
On another terminal: |
We are just using istio's gateways, we don't actually use the sidecar or any sidecar injection at all (we just have VirtualServices pointing at the knative gateway with rewritten authority etc. for each service) - so I don't think that one is connected Getting debug log is a bit more work so I will follow up with those - but I have managed to enable profiling and get the pprof dumps I've attached two dumps, only a few minutes apart but the latter the memory use had grown by 1-2%. These were taken when the system was under load but the autoscaler was already at ~93% memory vs. limit, so it's very close to OOMing pprof.autoscaler.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz and PNG version of the first dump for convenience I notice there is a lot of exporter+metric stuff here, and we do have knative configure to send to OTel via opencensus - is it enough of a presence in these dumps to suggest that OTel integration is the cause? |
Does not seem to be so even if it uses a lot of the allocated memory. I did a diff (
The biggest increase is ~40Mb at the streamwatcher. Could you take multiple snapshots and check the diff also during no load? Maybe it is related to this kubernetes/kubernetes#103789 (comment)? Do you have a lot of pods coming up during load times (autoscaler has a filtered informer for service pods)? |
Btw the default resync period is ~10h, see here https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/controller/controller.go#L54. |
What version of Knative?
Expected Behavior
Autoscaler is able to GC etc. and avoid OOM
Actual Behavior
There is a visible leak occurring every ~10 hours in the autoscaler in our environment. This creates a constant upward trend in memory use.
Although there is some attempt to GC and reduce this as the memory limit is reached, it's never quite enough and it does eventually OOM and restart
About our environment:
GOMEMLIMIT
to 1.7GiB to see if this helps keep it under control but it has no effect (it stays alive longer but it does still eventually OOM)Steps to Reproduce the Problem
What would be useful to repro/diagnose this? Is the minimum a debug level log from the autoscaler over the ~10hrs where the issue occurs?
The text was updated successfully, but these errors were encountered: