Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler slow memory leak #15624

Open
DavidR91 opened this issue Nov 21, 2024 · 4 comments
Open

Autoscaler slow memory leak #15624

DavidR91 opened this issue Nov 21, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@DavidR91
Copy link

DavidR91 commented Nov 21, 2024

What version of Knative?

1.16.0

Expected Behavior

Autoscaler is able to GC etc. and avoid OOM

Actual Behavior

There is a visible leak occurring every ~10 hours in the autoscaler in our environment. This creates a constant upward trend in memory use.

Although there is some attempt to GC and reduce this as the memory limit is reached, it's never quite enough and it does eventually OOM and restart

image (2)

About our environment:

  • GKE Kubernetes 1.30.6
  • knative 1.16, istio 1.23.3 and net-istio 1.16
  • The autoscaler is given request and limit of 2 CPU and 2Gi of memory (Guaranteed QoS)
  • The autoscaler is configured into HA mode: we have it scaled so there are 3 running at all times
    • Notable that when the primary autoscaler OOMs, we experience a significant request error spike, because this seems to negatively affect the activator - which is why we're more interested in solving this
    • The leak only seems to affect the primary/leader
  • We typically have about 2-300 different knative services. Most of them will have on average ~3 revisions at any one time
    • The graph above is from when the cluster is almost entirely idle. Most of the period of that graph, there are no service pods running at all
  • We've added GOMEMLIMIT to 1.7GiB to see if this helps keep it under control but it has no effect (it stays alive longer but it does still eventually OOM)
  • Nothing in particular happens in our environment at a 10 hour frequency (we have jobs and new service creation+deletion occurring on 24hr cycles, typically 8AM and midnight)
  • The same issue was observed in knative 1.9.2

Steps to Reproduce the Problem

What would be useful to repro/diagnose this? Is the minimum a debug level log from the autoscaler over the ~10hrs where the issue occurs?

@DavidR91 DavidR91 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 21, 2024
@skonto
Copy link
Contributor

skonto commented Nov 26, 2024

Hi @DavidR91,

Could you show more about the pod status (kubectl describe pod ...)? What is the behavior of the istio sidecar?
In the past there was a similar issue that was coming from the istio side.

What would be useful to repro/diagnose this?

Could you provide the logs of the autoscaler?
Could you take a heap dump during the time that the issue occurs?

You can enable profiling as follows.

On one terminal:

cat <<EOF | oc apply -f -
apiVersion: v1
data:
  profiling.enable: "true"
kind: ConfigMap
metadata:
  name: config-observability
  namespace: knative-serving
EOF

Kubectl port-forward <pod-name> -n knative-serving  8008:8008

On another terminal:
$ go tool pprof http://localhost:8008/debug/pprof/heap

@DavidR91
Copy link
Author

DavidR91 commented Nov 26, 2024

In the past there was #8761 that was coming from the istio side.

We are just using istio's gateways, we don't actually use the sidecar or any sidecar injection at all (we just have VirtualServices pointing at the knative gateway with rewritten authority etc. for each service) - so I don't think that one is connected

Getting debug log is a bit more work so I will follow up with those - but I have managed to enable profiling and get the pprof dumps

I've attached two dumps, only a few minutes apart but the latter the memory use had grown by 1-2%. These were taken when the system was under load but the autoscaler was already at ~93% memory vs. limit, so it's very close to OOMing

pprof.autoscaler.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
pprof.autoscaler.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

and PNG version of the first dump for convenience
profile001

I notice there is a lot of exporter+metric stuff here, and we do have knative configure to send to OTel via opencensus - is it enough of a presence in these dumps to suggest that OTel integration is the cause?

@skonto
Copy link
Contributor

skonto commented Nov 27, 2024

is it enough of a presence in these dumps to suggest that OTel integration is the cause?

Does not seem to be so even if it uses a lot of the allocated memory. I did a diff (go tool prof -base prof1 prof2) on the profiles you posted. Here is the output:

image
Same if you pass inuse_objects:

image

The biggest increase is ~40Mb at the streamwatcher. Could you take multiple snapshots and check the diff also during no load? Maybe it is related to this kubernetes/kubernetes#103789 (comment)? Do you have a lot of pods coming up during load times (autoscaler has a filtered informer for service pods)?

@skonto
Copy link
Contributor

skonto commented Nov 27, 2024

Btw the default resync period is ~10h, see here https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/controller/controller.go#L54.
Is your cluster a large one, is it slow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants