ENH(string dtype): Make str.decode return str dtype #60709

rhshadrach · 2025-01-12T21:16:51Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

rhshadrach · 2025-01-12T21:40:15Z

pandas/tests/strings/test_strings.py

@@ -566,7 +566,7 @@ def test_string_slice_out_of_bounds(any_string_dtype):
 def test_encode_decode(any_string_dtype):
    ser = Series(["a", "b", "a\xe4"], dtype=any_string_dtype).str.encode("utf-8")
    result = ser.str.decode("utf-8")
-    expected = ser.map(lambda x: x.decode("utf-8")).astype(object)
+    expected = Series(["a", "b", "a\xe4"], dtype="str")


The change from ser.map to using Series is just to make this test a bit more explicit. Using ser.map(...).astype("str") also passes.

jorisvandenbossche

Looks good!

jorisvandenbossche · 2025-01-13T09:35:08Z

pandas/io/pytables.py

+            if get_option("future.infer_string"):
+                data = ser.to_numpy()
+            else:
+                data = ser._values


You can probably simplify this and always to .to_numpy()? (or np.asarray(..))
In the case of object dtype in the else branch, that will return the same (and be as cheap) as _values I think

Confirmed - thanks.

jorisvandenbossche · 2025-01-24T20:13:37Z

@rhshadrach can you update this?

…decode

rhshadrach · 2025-01-25T12:32:57Z

@jorisvandenbossche - the issue with .to_numpy on NumPy-backed Series is that we set the underlying data to read-only. In pytables, we switch out NA values in libwriters.string_array_replace_from_nan_rep, which is causing the tests to fail.

Perhaps there could be a way (e.g. Series._to_numpy) to always get a corresponding NumPy array that isn't read-only? Barring this, it seems to me we could either always (and unnecessarily) make a copy, or use my original branching logic. Open to other ideas too.

jorisvandenbossche · 2025-01-26T11:40:09Z

Hmm, good point. Ideally we would be able to solve this without using private APIs, I think, because it is a good case study for what also other people (external code) could run into.

So I think what we have said before is that downstream users could do data.flags.writeable = True on the result of to_numpy() if they know what they are doing (and in this case we know that we indeed own the memory, because we are reading a file and created that data and not yet returned it to the user).

But this also makes me wonder if we should re-discuss if we have to add some keyword to to_numpy() to get this (e.g. something like writeable=True)

TST(string dtype): Make str.decode return str dtype

60a8eee

rhshadrach added Enhancement Strings String extension data type and string data labels Jan 12, 2025

rhshadrach marked this pull request as draft January 12, 2025 21:17

rhshadrach changed the title ~~TST(string dtype): Make str.decode return str dtype~~ ENH(string dtype): Make str.decode return str dtype Jan 12, 2025

Test fixups

513e3c3

rhshadrach commented Jan 12, 2025

View reviewed changes

pytables fixup

c1d9e6d

jorisvandenbossche approved these changes Jan 13, 2025

View reviewed changes

jorisvandenbossche added this to the 2.3 milestone Jan 22, 2025

rhshadrach added 3 commits January 24, 2025 20:27

Simplify

9a6a231

Merge branch 'main' of https://github.com/pandas-dev/pandas into str_…

7afd274

…decode

whatsnew

45aa4ae

rhshadrach marked this pull request as ready for review January 25, 2025 01:32

rhshadrach marked this pull request as draft January 25, 2025 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH(string dtype): Make str.decode return str dtype #60709

ENH(string dtype): Make str.decode return str dtype #60709

rhshadrach commented Jan 12, 2025

rhshadrach Jan 12, 2025

jorisvandenbossche left a comment

jorisvandenbossche Jan 13, 2025

rhshadrach Jan 25, 2025

jorisvandenbossche commented Jan 24, 2025

rhshadrach commented Jan 25, 2025

jorisvandenbossche commented Jan 26, 2025

ENH(string dtype): Make str.decode return str dtype #60709

Are you sure you want to change the base?

ENH(string dtype): Make str.decode return str dtype #60709

Conversation

rhshadrach commented Jan 12, 2025

rhshadrach Jan 12, 2025

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 13, 2025

Choose a reason for hiding this comment

rhshadrach Jan 25, 2025

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 24, 2025

rhshadrach commented Jan 25, 2025

jorisvandenbossche commented Jan 26, 2025