Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve map(), struct(), array() #8666

Closed
wants to merge 1 commit into from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Mar 15, 2024

fixes #8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.

Changes:

This is adding support for passing in None to all these constructors.
These use the new ibis.null(<type>) API to return op.Literal(None, <type>)s

Make these constructors idempotent: you can pass in existing Expressions into array(), etc.
The type argument for all of these now always has an effect, not just when passing in python literals. So basically it acts like a cast.

A big structural change is that now ops.Array has an optional
attribute "dtype", so if you pass in a 0-length sequence
of values the op still knows what dtype it is.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup.

You can test this locally with eg
pytest -m <backend> -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py

Also, fix executing Literal(None) on pandas and polars, 0-length arrays on polars

Next steps

A possible nice follow-up would be to completely remove supporting structs, arrays, and maps inside ops.Literal.
Currently, when someone calls ibis.array([1,2]), that now results in ops.Array, but if they do ibis.literal([1,2]), it is represented as ops.Literal. It would be nice if inside ibis.literal() there was some logic like

...
if isinstance(val, Mapping):
    return ibis.map(val)
if is_iterable(val):
    return ibis.array(val)
...

and then all the backends wouldn't have to even worry about supporting these nested types in visit_Literal()

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Can you add a few tests in the appropriate ibis/tests/expr/test_{dtype}.py files?

Also curious - what was the use case that led to you wanting this change?

@NickCrews
Copy link
Contributor Author

in my first attempt at #8649, I went about it the wrong way, but I needed to ensure something was an Array, but the incoming thing might already have been an array.

@NickCrews NickCrews force-pushed the idempotent-factories branch from b6d983a to a9ca09b Compare March 15, 2024 19:15
@NickCrews
Copy link
Contributor Author

OK, as I added those tests I found some problems, and this snowballed into also being a fix for #8289. The implementations for both of these things are somewhat intertwined. It would be a pain to fix one, and then just change it again to fixx the other. So I just did everything at once.

@NickCrews NickCrews force-pushed the idempotent-factories branch from a9ca09b to 654bcda Compare March 15, 2024 19:18
ibis/expr/types/arrays.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_struct.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_array.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_map.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_map.py Outdated Show resolved Hide resolved
ibis/expr/types/maps.py Outdated Show resolved Hide resolved
ibis/expr/types/maps.py Show resolved Hide resolved
ibis/expr/types/structs.py Outdated Show resolved Hide resolved
ibis/formats/pandas.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_array.py Outdated Show resolved Hide resolved
ibis/backends/pandas/convert.py Outdated Show resolved Hide resolved
ibis/backends/pandas/convert.py Outdated Show resolved Hide resolved
@NickCrews NickCrews force-pushed the idempotent-factories branch 4 times, most recently from 2fcf850 to 283fd6b Compare March 19, 2024 18:27
@NickCrews NickCrews force-pushed the idempotent-factories branch from 283fd6b to 530fad0 Compare March 22, 2024 04:10
@NickCrews
Copy link
Contributor Author

NickCrews commented Mar 22, 2024

My goodness this gotten a bit out of hand.

EDIT: actually in the latest version it's not so bad. I didn't need to overhaul the internal representation after all.

OK I just pushed an absolute doozy of a commit in 530fad0. Curious how many backends that fails on. It required really overhauling the internal representation of ops.Array and ops.Struct to be able to distinguish between

  • the overall value is NULL
  • the contained values is NULL
    Take a look at that commit description.

@NickCrews NickCrews force-pushed the idempotent-factories branch from 530fad0 to 55497bb Compare March 22, 2024 04:19
@ncclementi
Copy link
Contributor

Hi @NickCrews, checking in here. What's the status on this one? It looks like there are a lot of conflicts and red on CI 😅 .

@NickCrews
Copy link
Contributor Author

Yeeesssss this needs a bit of love :) Once my other pr with .cases lands I figured I would pick this up again.

@NickCrews NickCrews force-pushed the idempotent-factories branch from 55497bb to a00c26d Compare May 1, 2024 18:17
@NickCrews NickCrews force-pushed the idempotent-factories branch from a00c26d to 32e7636 Compare May 9, 2024 01:24
@NickCrews NickCrews changed the title feat: make map(), struct(), array() idempotent feat: improve map(), struct(), array() behave nicer May 9, 2024
@NickCrews NickCrews changed the title feat: improve map(), struct(), array() behave nicer feat: improve map(), struct(), array() May 10, 2024
@NickCrews NickCrews force-pushed the idempotent-factories branch 3 times, most recently from 2b7d36e to 4d5c6e0 Compare May 11, 2024 08:04
@NickCrews NickCrews force-pushed the idempotent-factories branch 2 times, most recently from 25c3e7b to b8dd34d Compare June 4, 2024 18:16
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 6, 2024
jcrist pushed a commit that referenced this pull request Jun 7, 2024
fixes ibis-project#8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in
some cases.

One this is adding support for passing in None to all these constructors.
These use the new `ibis.null(<type>)` API to return `op.Literal(None, <type>)`s

Make these constructors idempotent: you can
pass in existing Expressions into array(), etc.
The type argument for all of these now always has an effect, not just when passing in python literals. So basically it acts like a cast.

A big structural change is that now ops.Array has an optional
attribute "dtype", so if you pass in a 0-length sequence
of values the op still knows what dtype it is.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup.

You can test this locally with eg
`pytest -m <backend> -k factory ibis/backends/tests/test_array.py  ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py`

Also, fix executing Literal(None) on pandas and polars, 0-length arrays on polars

Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes.
def visit_Array(self, op, *, exprs):
return self.f.array(*exprs)
def visit_Array(self, op, *, exprs, dtype):
return self.cast(self.f.array(*exprs), dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems extraordinarily heavy-handed and overly broad.

Do we really need to cast the entire on constructing an array, for all SQL backends?

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2024

This PR is still too big. Can you split the array, map and struct implementations into three separate PRs?

@NickCrews
Copy link
Contributor Author

Yeah splitting it probably is good.

  • map is easy, no semantic changes going on there
  • for struct, I want to add tests for empty structs. Therefore I think we need to decide on and merge my other PR regarding this behavior
  • for arrays, it's gonna be a little hairy too, because I am implementing empty arrays at the same time as adjusting the API. I don't really want to do them separately or it will just be churn. Unless you see a decent way to address those two things separately

@cpcloud
Copy link
Member

cpcloud commented Jun 27, 2024

@NickCrews Just checking in here, can you split these up some more?

@NickCrews
Copy link
Contributor Author

OK, I just tried to split the struct and the map stuff into their own PRs, but both of those rely on the new ibis.array(arg, typ) API. So we need to get the array stuff in first. I'll split that out.

NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 27, 2024
@NickCrews
Copy link
Contributor Author

@NickCrews NickCrews closed this Jun 27, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 27, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 27, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 28, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 28, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 29, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 29, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 29, 2024
Picking out the array stuff from ibis-project#8666

Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones.
By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction,
we don't have to do any fancy casting during compilation, all the elements will
already have been casted as needed.

This allows for the compilation of array<structs> on some sql backends like postgres.
If we tried to cast the entire array, you end up with SQL like `cast [..] as STRUCT<...>[]`,
which postgres freaks about.
Instead, if we cast each individual element,
such as `[cast({...} as ROW..., cast({...} as ROW...]`, then this is valid SQL.

I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time,
and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly,
so this is a guardrail just for us ibis devs.
So IDK, we could remove it, but I think it is a nice hint for future us.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Support type arg to ibis.array and ibis.map
4 participants