-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starter ideas for using xarray and Awkward Array together #4
Comments
Thanks a lot for this @jpivarski. Very helpful. I have a few questions for now. Storing each array in coordinates is smart albeit a tad awkward. But if it ends up being an implementation detail and the user gets a nice syntax, I don't see an issue with it. A few questions for now:
Obviously this ragged DataArray is not taking up 16 gigs. Is this because I'm impressed that you could even do this much. :) |
Ideally, we'd want to write a subclass for I'm not surprised that xarray has such a mechanism: Awkward Array also has an override mechanism (ak.behavior) that has some features in common, which arose from similar needs. So for (1), it wouldn't be through For (2), that's NumPy: >>> array = np.array([[1, 2], [3, 4]])
>>> array.nbytes
32
>>> not_actually_more_memory = np.lib.stride_tricks.as_strided(array, (1000000, 1000000), (0, 8))
>>> not_actually_more_memory.nbytes
8000000000000 The
Using (3) which coordinate is zero-length? The coordinates that correspond to list offsets kinda make sense (though the outermost one would make more sense if it were 1 shorter; then it would have the length of the array it represents). The last thing that I put in For instance, to represent >>> offsets = np.array([0, 3, 3, 5])
>>> content = np.array([1.1, 2.2, 3.3, 4.4, 5.5]) we could pack them like >>> x = xr.DataArray(
... np.lib.stride_tricks.as_strided(content, (3, 5), (0, 8)),
... {"stops": offsets[1:],
... "": np.lib.stride_tricks.as_strided(np.int64(0), (5,), (0,)) # need *some* coordinate of length 5
... }
... )
>>> x
<xarray.DataArray (stops: 3, : 5)>
array([[1.1, 2.2, 3.3, 4.4, 5.5],
[1.1, 2.2, 3.3, 4.4, 5.5],
[1.1, 2.2, 3.3, 4.4, 5.5]])
Coordinates:
* stops (stops) int64 3 3 5
* () int64 0 0 0 0 0 That way, the data part of the xarray isn't full of In the above, I also sliced the On the other hand, the original >>> x.coords["stops"].values.base
array([0, 3, 3, 5]) So, there are a lot of different ways to go, but they each have their downsides. Despite the admonitions in the documentation, maybe it would be possible and reasonable to make a subclass of |
I realize that you're collecting use-case ideas right now, but eventually you'll need implementations and here's an idea to start.
Efficiently encoding ragged data in xarray
An Awkward Array can be broken down into a set of different-length one-dimensional arrays, and xarray coordinates can all have different lengths. The data block needs to be a product of those dimensions, but what if the data block is a zero-strided array, so that it can have arbitrary shape but take no memory?
Here's a converter from Awkward list-type arrays (of arbitrary depth, but only lists) to xarray:
The xarrays made this way don't look like normal arrays, and they shouldn't. The wall of
nan
is a hint that this is not a normal array.Doing Awkward-style slices (and other methods)
Now here's an accessor that reconstructs the Awkward Array (maybe it should only be allowed to succeed if the data consist of zero-strided NaNs?). It also provides methods like
__getitem__
that lift from xarray into Awkward, performs the slice, and then back to xarray.So any slice that could have been performed on the Awkward Array,
can now be performed on the xarray as well, as long as we go through the
ak
accessor:Since these conversions between xarray and Awkward would be happening frequently, it's important that they are zero-copy.
Similarly,
and
Extensions
Now I'm getting greedy again: I don't want to be limited to only lists of (lists of...) numbers, but also regular-length dimensions, nested records, missing data, and all that. Since xarrays can hold any number of different-length coordinates, maybe we can unpack arbitrary arrays:
but then the coordinate names would have less meaning. These three can be identified as nested-list dimensions, but if the array has any missing data, there would be additional "
coords
" for the masks, if it has any regular dimensions, there wouldn't be corresponding "coords
" for those dimensions, if it has nested record fields, there would be a lot of "coords
", etc.So it's a question of how closely the "
coords
" needs to correspond to actual coordinates. In the previous example, xarrayx
had as many dimensions as the array it represented, but the lengths of those dimensions didn't have a direct relationship with the ragged array. Other than the first dimension, it can't because the array is ragged. (And if you make the first dimension be just the stopping indexes of each list, rather than fence-posts between all the lists, then converting back to an Awkward Array can't be zero-copy.)Metadata
The example I showed above preserves the names of some of the axes by putting the xarray
axis
names into Awkwardparameters
, performing the slice, and then pulling them back out. The name"second"
was lost (replaced with"dim_1"
by the code written above) because it was rewritten by the[0, -1]
part of the slice. Although that's a natural consequence of one layout node being replaced by another, it's probably not what we want here.scikit-hep/awkward#1391 is a still-open request for Awkward to handle metadata better: to preserve it and propagate it through calculations in a way that is appropriate for xarray. Actually adding that Awkward feature depends on how it will be used in conversions to and from xarray, so that PR is interdependent with this project.
Thoughts?
What's good and what's bad?
Cc: @TomNicholas, @joshmoore, who were also on the email.
The text was updated successfully, but these errors were encountered: