Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Clean up interfaceLockMap entries on endpoint deletion #1249

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

byte-msft
Copy link
Contributor

Description

The packetParser was creating entries in interfaceLockMap for each new interface
but failing to remove them when interfaces were deleted. In environments with
high pod counts and frequent churn, this caused a memory leak as the map grew
indefinitely.

Related Issue

Potential memory leak in packetparser's interfaceLockMap #1236

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes made.

Additional Notes

Solution

  • Added cleanup of interfaceLockMap entries in the EndpointDeleted case
  • Improved mutex handling logic to prevent resource leaks
  • Updated test cases to verify proper cleanup of both tcMap and interfaceLockMap

Testing

  • Added comprehensive test coverage for interface deletion scenario
  • Verified cleanup of both maps in test cases
  • Tested with high pod churn scenarios

Impact

This fix prevents memory leaks in environments with frequent pod creation/deletion,
improving the overall stability and resource usage of the system.

Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

…ions

Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
@byte-msft byte-msft self-assigned this Jan 21, 2025
@byte-msft byte-msft requested a review from a team as a code owner January 21, 2025 10:52
@byte-msft byte-msft linked an issue Jan 21, 2025 that may be closed by this pull request
…r pacage

Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
@byte-msft byte-msft requested a review from nddq January 21, 2025 11:03
Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
Copy link
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'll leave @nddq or @rbtr to review & approve

p.l.Debug("Endpoint created", zap.String("name", iface.Name))
p.createQdiscAndAttach(iface, Veth)
case endpoint.EndpointDeleted:
// Get the mutex only if it exists
lockMapVal, exists := p.interfaceLockMap.Load(ifaceKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a nitpick / question since I'm still new to Go. Is it recommended to stick with the ok idiom? Or is it fine to use other variable names like exists?

Copy link
Contributor Author

@byte-msft byte-msft Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Kamil! Yeah, I initially used exists to make the code more readable, but you're right - we should stick with ok to follow Go conventions in the main code.

For the test case though, I deliberately kept tcMapExists and lockMapExists because test code is a bit different - clarity is super important there since we're verifying specific behaviors. The more explicit naming makes it immediately obvious what we're testing for, especially when someone's debugging failed tests.

Let me know if you'd like me to update the main code to use ok instead of exists!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure yeah, let's stick with ok for the main code then :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ok is still best in this circumstance. It's widely-understood what that means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's resolved

@nddq
Copy link
Contributor

nddq commented Jan 21, 2025

we should take this chance to examine whether or not do we even need these sync.Mutex for each interface, since we are processing them sequentially and not concurrently anyway


switch event.Type {
case endpoint.EndpointCreated:
// Create mutex only when needed
lockMapVal, _ := p.interfaceLockMap.LoadOrStore(ifaceKey, &sync.Mutex{})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious of the need here. This seems a little bit complex and could a simpler approach be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks

I've removed the Mutex mechanism since we're now using a sequential approach for adding and removing interfaces

Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
@byte-msft
Copy link
Contributor Author

we should take this chance to examine whether or not do we even need these sync.Mutex for each interface, since we are processing them sequentially and not concurrently anyway

Yeah, you make a valid point! Since we're processing interfaces sequentially, we can safely remove the mutex mechanism. Let's keep it simple and avoid unnecessary complexity. I've updated the PR to remove the mutex-related code.

ifaceKey := ifaceToKey(iface)
lockMapVal, _ := p.interfaceLockMap.LoadOrStore(ifaceKey, &sync.Mutex{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you deleting this? The interfaceLockMap allows us to store a per interface lock and we can create/delete multiple qdisc in parallel. This is necessary (in place of a single lock) because large number of pods can come up at the same time, and we should start capturing packets as quickly as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not clear on this. Why is a per-interface mutex necessary for packetparser to handle concurrent create/delete qdisc operations? As far as I understand, operations on different interfaces shouldn't cause a data race.

if value, ok := p.tcMap.Load(ifaceKey); ok {
v := value.(*tcValue)
p.clean(v.tc, v.qdisc)
// Delete from map.
p.tcMap.Delete(ifaceKey)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to delete the ifacekey from interfaceLockMap if it's deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted my latest changes, so I brought back the interfaceLockMap and removing the ifacekey from the map

Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
p.l.Debug("Endpoint created", zap.String("name", iface.Name))
p.createQdiscAndAttach(iface, Veth)
// Get or create mutex atomically
lockMapVal, loaded := p.interfaceLockMap.LoadOrStore(ifaceKey, &sync.Mutex{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bring this inside the switch case? We are duplicating code in L399-L404.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Anubhab,

Yeah, I get the concern about code duplication, but in this case I think it makes sense to keep the mutex logic separate in each case. The create and delete paths need different mutex handling - creation needs to make a new mutex if it doesn't exist, while deletion only works with existing ones. So, I added an extra check.

Trying to handle it outside the switch would probably make the code more generalized and less clear, in addition we are locking the map if the other cases will be added.

What do you think? Happy to revert the code to the previous state, if you think if it's unnecessary

Signed-off-by: Yerlan Baiturinov <ybaiturinov@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential memory leak in packetparser's interfaceLockMap
7 participants