Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST(string dtype): Resolve xfails in test_from_dummies #60694

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The current behavior assumes the default_category provided can be coerced to the dtype of the input's columns. When the input's columns labels are strings, and the default_category is an integer, currently with object dtype we end up with values that are a mix of strings and integers. With infer_string=True where the input's columns are str dtype, we end up instead with all strings (coercing the integer to a string).

It's not clear to me whether this case should result in object dtype with a mix of strings and integers, or str dtype. Thoughts here are welcome. A few cases to consider are below. Currently I'm going with backwards compatibility, but open to other directions.

df = DataFrame({3: [1, 0, 0], 4: [0, 1, 0]})
result = pd.from_dummies(df, default_category=0.5)
# ValueError: Trying to coerce float values to integers

df = DataFrame({"x": [1, 0, 0], "y": [0, 1, 0]})
result = pd.from_dummies(df, default_category=5)
print(type(result.iloc[2, 0]))
# <class 'int'>

@rhshadrach rhshadrach added Testing pandas testing functions or related to the test suite Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action labels Jan 11, 2025
@rhshadrach rhshadrach added this to the 2.3 milestone Jan 11, 2025
@rhshadrach rhshadrach marked this pull request as draft January 11, 2025 17:51
@rhshadrach rhshadrach marked this pull request as ready for review January 25, 2025 12:13
@rhshadrach
Copy link
Member Author

@jorisvandenbossche @WillAyd friendly ping

@WillAyd
Copy link
Member

WillAyd commented Jan 25, 2025

The current behavior assumes the default_category provided can be coerced to the dtype of the input's columns. When the input's columns labels are strings, and the default_category is an integer, currently with object dtype we end up with values that are a mix of strings and integers. With infer_string=True where the input's columns are str dtype, we end up instead with all strings (coercing the integer to a string).

It's not clear to me whether this case should result in object dtype with a mix of strings and integers, or str dtype. Thoughts here are welcome. A few cases to consider are below. Currently I'm going with backwards compatibility, but open to other directions.

This is a tough one but I don't think we should do any special-casing in this method, so should just stick with what the different string types do (even though coercion may not be consistent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants