Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: type error when accessing length() without schema information #10651

Open
1 task done
choucavalier opened this issue Jan 2, 2025 · 11 comments
Open
1 task done
Labels
bug Incorrect behavior inside of ibis

Comments

@choucavalier
Copy link

What happened?

The following example is self-explanatory

By default, it's assumed that t.some_col is of type Column, while it could be of type ArrayColumn (or other more specific types for that matter).

This results in typing issues.

import ibis
import pandas as pd


d = pd.DataFrame(
  {
    "some_col": [
      [1, 2, 3],
      [4, 5],
    ]
  }
)
t = ibis.memtable(d)

t.some_col.length()  # ■ Cannot access attribute "length" for class "Column"

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@choucavalier choucavalier added the bug Incorrect behavior inside of ibis label Jan 2, 2025
@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

I can't reproduce this on main:

In [4]: import ibis
   ...: import pandas as pd
   ...:
   ...:
   ...: d = pd.DataFrame(
   ...:   {
   ...:     "some_col": [
   ...:       [1, 2, 3],
   ...:       [4, 5],
   ...:     ]
   ...:   }
   ...: )
   ...: t = ibis.memtable(d)

In [5]: t.some_col.length()
Out[5]:
r0 := InMemoryTable
  data:
    PandasDataFrameProxy:
          some_col
      0  [1, 2, 3]
      1     [4, 5]

ArrayLength(some_col): ArrayLength(r0.some_col)

@choucavalier
Copy link
Author

The code works. It's the static type checking that outputs an error while it should not.

@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

Schema information generally can't be made static in Python.

A type checker would need access to information from a running database.

@choucavalier
Copy link
Author

choucavalier commented Jan 2, 2025

Of course, but in this case I think the DX would be much better if instead of assuming that the column can't be a list, we assume it could be, and prevent the type checker from erroring

Right now I have to do

from typing import cast
from ibis.expr.types import ArrayColumn

t.select(t).filter(
  cast(ArrayColumn, t.my_list_column).length() > ibis.literal(0)
)

(type checking also requires me to write ibis.literal(...) for any constant value.

More and more projects use type checking, and not having it in ibis will prevent all of them from having a good DX and reduce ibis' adoption.

@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

It sounds like you effectively want every type-specific method to exist on Column. Is that correct?

@choucavalier
Copy link
Author

That could be a solution. Maybe there's another one which does not imply modifying the Column definition, but instead provide some type hints that could help the static type checkers understand that "by default this could actually be an ArrayColumn, or a StructColumn (or others) so let's assume that these collection-specific methods might exist because we don't have access to the schema of the table so let's trust this guy that he knows what he's doing when querying his own data. if he's trying to access the method length on a column that's actually not an ArrayColumn we'll throw an error at runtime. but let's not ask the guy to write type casting code everywhere or disable the type checker on all the lines that actually use ibis"

@cpcloud
Copy link
Member

cpcloud commented Jan 2, 2025

In that case a giant ir.StringColumn | ir.IntegerColum | ... with every single column subclass as the return value of Table.__getattr__ would probably suffice for this specific use case.

If this pattern shows up in a bunch of places maybe we can make a type alias to simplify our lives.

@NickCrews
Copy link
Contributor

a giant ir.StringColumn | ir.IntegerColum | ...

Wouldn't pyright still complain in this case? if you try to do .upper() on something that is ir.StringColumn | ir.IntegerColumn, then it could be an ir.IntegerColumn, which would fail at runtime. Or is pyright optimistic?

I started #10682 as an attempt to slightly mitigate this pain point for some workflows. eg if/after that lands, then my_table.my_int_column.cast(int) now correctly types as a ir.IntegerValue (our type system doesn't currently allow us to be more specific and say ir.IntegerColumn ☹ ). But that PR can't work for complex dtypes like arrays, because we can't enumerate all possible array<int>, array<array<int>>, array<array<array<int>>>, etc options to @typing.overload

@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2025

eg if/after that lands

It's been merged into main!

our type system doesn't currently allow us to be more specific

That is by design. ir.IntegerColumn is an implementation detail, not a type that represents values of a column in a database.

can't work for complex dtypes like arrays

I think this is true of any type system that supports recursive types, because there are infinitely many types in such systems.

@NickCrews
Copy link
Contributor

ir.IntegerColumn is an implementation detail

Are you sure? ir.IntegerScalar does not have a .max() method, ir.IntegerColumn does. We talked about this here as well. I think if it were possible, I think it would be better if ibis's type system could discern between Columns and Scalars.

@cpcloud
Copy link
Member

cpcloud commented Jan 18, 2025

Are you sure?

Yes.

By implementation detail here, I mean that we don't expose those objects as inputs to expression APIs.

Casting is an in-database operation, and our backends don't and probably won't (and shouldn't) ever have knowledge of how Ibis models expressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

No branches or pull requests

3 participants