Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: execution of struct literals in the pandas and dask backends returns the wrong value #8687

Closed
1 task done
NickCrews opened this issue Mar 18, 2024 · 8 comments · Fixed by #8693
Closed
1 task done
Labels
bug Incorrect behavior inside of ibis

Comments

@NickCrews
Copy link
Contributor

What happened?

import pandas as pd
from ibis.common.collections import frozendict
pd.Series([frozendict({"a": 5, "b": 6})])
# 0    (a, b)
# dtype: object

This is causing casting bugs in the pandas and dask backends:

import ibis

con = ibis.pandas.connect()

s = ibis.struct({"a": 1, "b": 2})
s2 = s.cast("struct<a: float64, b: float64>")
con.execute(s2)
# {'a': 'a'}

found this in #8666

I think we want to make it so our builtin frozendict looks exactly like a dict to pandas. Then we get the right behavior here

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Mar 18, 2024
@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 18, 2024

hmmm, actually maybe this is just a repr thing?? pd.Series([frozendict({"a": 5, "b": 6})])[0] actually shows the right thing...

EDIT: while the repr is confusing, the actual result from .execute() is also wrong.

NickCrews added a commit to NickCrews/ibis that referenced this issue Mar 18, 2024
fixes ibis-project#8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken,
we can fix them in a followup.

You can test this locally with eg
`pytest -m duckdb -k factory ibis/backends/tests/test_array.py  ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py`

Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns

Also, fix executing NULL arrays on pandas.

Also, fix casting structs on pandas.
See ibis-project#8687

Also, support passing in None.

Also, error when the value type can't be inferred from empty python literals
(eg what is the value type for the elements of []?)

Also, make the type argument for struct() always have an effect, not just when passing in python literals.
So basically it can act like a cast.

Also, make these constructors idempotent.
@NickCrews
Copy link
Contributor Author

I added a workaround at https://github.com/ibis-project/ibis/pull/8666/files#diff-7760dac05d92d4bd73e3b729a90b7002cc3920f03abd1dac5fc7ec18154f9fb2R87-R93. Would love someone's help with debugging and then fixing this at the root of the problem, it's too deep in ibis for me.

@jcrist
Copy link
Member

jcrist commented Mar 19, 2024

hmmm, actually maybe this is just a repr thing??

Yeah, this is just a repr thing - if you make a custom collections.abc.Mapping and use it like you have, you'll see the same thing. My guess is pandas has a custom formatter for any non-builtin sequence, and that's what you're seeing here.

@NickCrews
Copy link
Contributor Author

Ah, not specific enough, in my first examples its just a repr thing. But in the 2nd example above with the .execute(), it is a real problem that needs to get fixed in ibis.

@cpcloud cpcloud changed the title bug: our custom frozendict doesn't play well with pandas bug: execution of struct literals in the pandas and dask backends returns a subset of the struct Mar 19, 2024
@cpcloud
Copy link
Member

cpcloud commented Mar 19, 2024

it is a real problem that needs to get fixed in ibis.

Indeed, and PRs welcome. The pandas and dask backends are very low priority right now. Are you using them for something in particular?

@jcrist
Copy link
Member

jcrist commented Mar 19, 2024

I've pushed up a PR in #8693 that makes frozendict a subclass of dict. This fixes both the repr-ing issue and the struct execution issue (it's also simpler and fixed a bug in the existing code). Should be able to revert the convert_Struct changes you added in #8666 once that's merged.

NickCrews added a commit to NickCrews/ibis that referenced this issue Mar 19, 2024
fixes ibis-project#8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken,
we can fix them in a followup.

You can test this locally with eg
`pytest -m duckdb -k factory ibis/backends/tests/test_array.py  ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py`

Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns

Also, fix executing NULL arrays on pandas.

Also, fix casting structs on pandas.
See ibis-project#8687

Also, support passing in None.

Also, error when the value type can't be inferred from empty python literals
(eg what is the value type for the elements of []?)

Also, make the type argument for struct() always have an effect, not just when passing in python literals.
So basically it can act like a cast.

Also, make these constructors idempotent.
@NickCrews
Copy link
Contributor Author

@cpcloud I see you changed the issue title to "subset of the struct". But just to make sure you see the difference, the current output is {"a": "a"}, not {"a": 5}

@NickCrews
Copy link
Contributor Author

The pandas and dask backends are very low priority right now. Are you using them for something in particular?

No not at all. I just thought it would be better to get the the tests passing instead of marking them as xfail. If we want to punt on this, then I would think it would be better to just error on this cast, instead of what we do now which is silently return the wrong result.

@cpcloud cpcloud changed the title bug: execution of struct literals in the pandas and dask backends returns a subset of the struct bug: execution of struct literals in the pandas and dask backends returns the wrong struct value Mar 19, 2024
@cpcloud cpcloud changed the title bug: execution of struct literals in the pandas and dask backends returns the wrong struct value bug: execution of struct literals in the pandas and dask backends returns the wrong value Mar 19, 2024
kszucs added a commit that referenced this issue Mar 21, 2024
This was motivated to work around pandas not repr'ing `frozendict`
elements properly (seen in #8687), but while poking at that I found:

- The existing code was unnecessarily nested (the actual dict was boxed
in a `MappingProxyType` which was boxed in a `FrozenDict` - we can do
better by just using storing the data in the `FrozenDict` itself).
- There was a bug in the hash implementation where the order mattered,
meaning that `hash(frozendict(a=1, b=2)) != hash(frozendict(b=2, a=1))`.
This has since been fixed.

Fixes #8687.

---------

Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Mar 21, 2024
NickCrews added a commit to NickCrews/ibis that referenced this issue Mar 22, 2024
fixes ibis-project#8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.

The big structural change is that now the core Operations for Array and Structs have a different internal representation, so they can distringuish between
- the entire value is NULL
- the contained values are NULL

Before, ops.Array held onto a `VarTuple[Value]`. So the contained Values could be NULL, but there was no way to say the entire thing was null. Now, ops.Array stores a `None | VarTuple[Value]`. The same thing for ops.StructValue. ops.Map didn't suffer from this, because it stores a `ops.Array`s internally, so since `ops.Array` can distinguish between entirely-NULL and contains-NULL, so can ops.Map

A fallout of this is that ops.Array needs a way to explicitly store its dtype. Before, it derived its dtype based on the dtype of its args. But
now that `None` is a valid value, it is now possible for there to be no values to inspect! So the Op actually stores its dtype explicitly. If you pass in values, then supplying the dtype on construction is optional, we go back to the old behavior of deriving it from the inputs.

This requires the backend compilers to now deal with that case.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken,
we can fix them in a followup.

You can test this locally with eg
`pytest -m <backend> -k factory ibis/backends/tests/test_array.py  ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py`

Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns

Also, fix executing NULL arrays on pandas.

Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes.

Also, fix casting structs on pandas.
See ibis-project#8687

Also, support passing in None to all these constructors.

Also, error when the value type can't be inferred from empty python literals
(eg what is the value type for the elements of []?)

Also, make the type argument for struct() always have an effect, not just when passing in python literals.
So basically it can act like a cast.

Also, make these constructors idempotent.
NickCrews added a commit to NickCrews/ibis that referenced this issue Mar 22, 2024
fixes ibis-project#8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.

The big structural change is that now the core Operations for Array and Structs have a different internal representation, so they can distringuish between
- the entire value is NULL
- the contained values are NULL

Before, ops.Array held onto a `VarTuple[Value]`. So the contained Values could be NULL, but there was no way to say the entire thing was null. Now, ops.Array stores a `None | VarTuple[Value]`. The same thing for ops.StructValue. ops.Map didn't suffer from this, because it stores a `ops.Array`s internally, so since `ops.Array` can distinguish between entirely-NULL and contains-NULL, so can ops.Map

A fallout of this is that ops.Array needs a way to explicitly store its dtype. Before, it derived its dtype based on the dtype of its args. But
now that `None` is a valid value, it is now possible for there to be no values to inspect! So the Op actually stores its dtype explicitly. If you pass in values, then supplying the dtype on construction is optional, we go back to the old behavior of deriving it from the inputs.

This requires the backend compilers to now deal with that case.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken,
we can fix them in a followup.

You can test this locally with eg
`pytest -m <backend> -k factory ibis/backends/tests/test_array.py  ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py`

Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns

Also, fix executing NULL arrays on pandas.

Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes.

Also, fix casting structs on pandas.
See ibis-project#8687

Also, support passing in None to all these constructors.

Also, error when the value type can't be inferred from empty python literals
(eg what is the value type for the elements of []?)

Also, make the type argument for struct() always have an effect, not just when passing in python literals.
So basically it can act like a cast.

Also, make these constructors idempotent.
NickCrews added a commit to NickCrews/ibis that referenced this issue May 1, 2024
fixes ibis-project#8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.

The big structural change is that now the core Operations for Array and Structs have a different internal representation, so they can distringuish between
- the entire value is NULL
- the contained values are NULL

Before, ops.Array held onto a `VarTuple[Value]`. So the contained Values could be NULL, but there was no way to say the entire thing was null. Now, ops.Array stores a `None | VarTuple[Value]`. The same thing for ops.StructValue. ops.Map didn't suffer from this, because it stores a `ops.Array`s internally, so since `ops.Array` can distinguish between entirely-NULL and contains-NULL, so can ops.Map

A fallout of this is that ops.Array needs a way to explicitly store its dtype. Before, it derived its dtype based on the dtype of its args. But
now that `None` is a valid value, it is now possible for there to be no values to inspect! So the Op actually stores its dtype explicitly. If you pass in values, then supplying the dtype on construction is optional, we go back to the old behavior of deriving it from the inputs.

This requires the backend compilers to now deal with that case.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken,
we can fix them in a followup.

You can test this locally with eg
`pytest -m <backend> -k factory ibis/backends/tests/test_array.py  ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py`

Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns

Also, fix executing NULL arrays on pandas.

Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes.

Also, fix casting structs on pandas.
See ibis-project#8687

Also, support passing in None to all these constructors.

Also, error when the value type can't be inferred from empty python literals
(eg what is the value type for the elements of []?)

Also, make the type argument for struct() always have an effect, not just when passing in python literals.
So basically it can act like a cast.

Also, make these constructors idempotent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants