Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix memory leak with dlpack when using python Tensor objects #421

Merged
merged 10 commits into from
Jan 6, 2025

Conversation

nnshah1
Copy link
Contributor

@nnshah1 nnshah1 commented Dec 18, 2024

Update:

The root cause - thanks to @tanmayv25 - was both the dynamic memory for the shape array and the reference to the owning tensor. While the reference count of the object was being incremented - it wasn't being passed correctly via the void_p and thus the DecrRef didn't release the object.

In order to increment the reference count and pass the object as a void pointer we follow a pattern outlined here:

  https://groups.google.com/g/dev-python/c/QRRqVC7gkf4/m/zH7l1gTXBwAJ

Also updated to create the shape array as part of a new object used for the management - to simply the deletion logic.

Updated the tests to include memory leak tests and tests for basic reference counting.


Initial investigation found that the mechanism for setting the shape field of the managed tensor was not being cleaned up correctly. This caused tensors to not be released even when the garbage collector was forced to run.

Once the shape was changed to be allocated via malloc and freed via malloc - the tests still require the garbage collector to be forced to run to remove tensors deterministically. That indicates there is a circular reference - but exact RCA is not known.

The Tensor object stores a reference to the original object (for example numpy).
When a numpy array is created from the Tensor object - a reference is created to the Tensor object (and implicitly the original numpy array).

When the objects are deleted or dereferenced - they should automatically dereference completely.

 for index in range(50):

    # Create original source in this case on GPU
    tensor = cupy.ones(2**27)

   # create a Triton Tensor object
   # is zero copy - contains reference to tensor object

    dl_pack_tensor = tritonserver.Tensor.from_dlpack(tensor)

    # create second array from Triton tensor
    # is zero copy - contains reference to Triton tensor
    array = cupy.from_dlpack(dl_pack_tensor)

    #            print(index, index*torch.numel(tensor)*tensor.element_size())
   
    # Delete array - also deletes capsule
    # should dereference Triton Tensor
    del array

    #  Delete Triton Tensor
    # should deference original tensor
    del dl_pack_tensor

    # Delete original storage
    # Should free actual memory
    del tensor

    print(index)

@nnshah1 nnshah1 marked this pull request as ready for review January 3, 2025 23:18
@rmccorm4 rmccorm4 changed the title fix tensor memory fix: Fix memory leak with dlpack when using python Tensor objects Jan 6, 2025
@tanmayv25 tanmayv25 requested a review from kthui January 6, 2025 20:05
@kthui
Copy link
Contributor

kthui commented Jan 6, 2025

If I understand correctly, the issue is calling

ctypes.pythonapi.Py_DecRef(...)
ctypes.pythonapi.Py_IncRef(...)

on

dl_managed_tensor.dl_tensor.shape

object directly has no effect, but it will work if the shape variable is referencing another object, i.e.

manager_ctx.shape

and the Inc/Dec ref is called on the managed_ctx?

@nnshah1
Copy link
Contributor Author

nnshah1 commented Jan 6, 2025

If I understand correctly, the issue is calling

ctypes.pythonapi.Py_DecRef(...)
ctypes.pythonapi.Py_IncRef(...)

on

dl_managed_tensor.dl_tensor.shape

object directly has no effect, but it will work if the shape variable is referencing another object, i.e.

manager_ctx.shape

and the Inc/Dec ref is called on the managed_ctx?

yes essentially - tensor.shape itself is a pointer to the array directly and not a python object. There are a couple ways to solve that - but in this case the simplest seemed to create a single object that held references for all required objects - and then incr / decr that as a whole.

@nnshah1 nnshah1 merged commit f3610e4 into main Jan 6, 2025
2 checks passed
@nnshah1 nnshah1 deleted the nnshah1-fix-tensor-memory branch January 6, 2025 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants