Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract true gpu object from cudf.pandas proxy object #6273

Closed
wants to merge 1 commit into from

Conversation

galipremsagar
Copy link
Contributor

Fixes: #6232

This PR is WIP

@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels Jan 29, 2025
@galipremsagar galipremsagar self-assigned this Jan 29, 2025
Copy link

copy-pr-bot bot commented Jan 29, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jan 29, 2025
@galipremsagar
Copy link
Contributor Author

/okay to test

@@ -231,7 +233,9 @@ def fit_transform(self, y, z=None) -> cudf.Series:
This is functionally equivalent to (but faster than)
`LabelEncoder().fit(y).transform(y)`
"""

if cudf.get_option("mode.pandas_compatible"):
Copy link
Contributor Author

@galipremsagar galipremsagar Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @vyasr @mroeschke @bdice @dantegd I'm debating whether this check should live inside cudf.Series(also DataFrame, Index, etc.) constructor itself. I know I reverted the change in rapidsai/cudf#17629 but after looking at cuml's frequency of usages of cudf.Series/DataFrame constructors, I'm having second thoughts about a special utility(to check and extract true GPU object) in cuml vs baking this utility into cudf classics constructors. I'm inclining towards the later seeing cuml specifically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a function in cuML might not be a bad idea, was thinking we could encapsulate the functionality into a function like

def create_cudf_series(y)
    if cudf.get_option("mode.pandas_compatible"):
        if is_proxy_object(y):
            y = y.as_gpu_object()
    y = cudf.Series(y)
    return y

for cuDF objects that we could use as a one liner around, though I'm not sure if we are targetting other codebases with this?

Copy link
Contributor Author

@galipremsagar galipremsagar Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for cuDF objects that we could use as a one liner around, though I'm not sure if we are targetting other codebases with this?

Yes, this is a utility I was planning to add to cuml, and other libraries.. but we will have to add create_cudf_DataFrame, create_cudf_Index, etc.. too. Plus we need to keep duplicating and constantly maintaining all the parameters to Series & DataFrame in these utilities. I'm thinking this might end up being a head-ache to consumers of cudf and libraries might push back the cudf<->cudf.pandas interop as a technical detail to cudf classic.

I know keeping the cudf.pandas<->cudf interop inside cudf classic might look complex but it feels simpler than having to change many libraries and maintaining those utilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't really see why separate create_cudf_* methods would needed?

At least for cuml IIRC, a lot of the API accepts a generic argument and calls a cudf constructor on it, and therefore we just need a utility to ensure a cudf.pandas argument is turned into a GPU argument before calling the cudf constructor (equivalent to doing e.g. cudf.Series(cudf.Series(...))

So I think a function like

def maybe_extract_cudf_pandas(arg):
    if isinstance_cudf_pandas(arg, (pd.Series, pd.DataFrame, pd.Index, np.ndarray)):
        return arg.as_gpu_object()
    return arg

Would need to be defined in cuml (and possibly any RAPIDS library that follows the cuml approach to using cudf)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some offline discussion we're going to try and monkey-patch cudf inside cudf.pandas to support this use case.

@galipremsagar
Copy link
Contributor Author

closing this PR in favor of: rapidsai/cudf#17878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF
4 participants