Efficient `reduceby` #1

mrocklin · 2014-04-05T18:04:57Z

Reduceby is pytoolz' implementation to the split-apply-combine strategy, a very common data analysis pattern. It's equivalent to pandas'

df.groupby(index)[selection].apply(reduction)

It's also equivalent to Julia's DataFrame's by operation http://juliastats.github.io/DataFrames.jl/split_apply_combine.html

To my knowledge no one uses reduceby directly, its interface, while simple, is hard enough to scare away most non-experts.

The current interface accepts Python functions to split (often a get), and pushes the apply-combine steps into a single associative binary operator. This use of external functions is idiomatic for toolz but limits efficiency for cytoolz. How can we modify the API (or create an entirely new API) to hit this application with great efficiency?

A fast, intuitive, streaming split-apply-combine operation on core data structures would be a serious motivator for some.

The text was updated successfully, but these errors were encountered:

eriknw · 2014-04-08T03:16:23Z

Good problem statement. I think this may require a decent amount of consideration to achieve the goals you stated.

Split (key argument):

Check if key is callable. This is the most generic case.
Check if key is a list. This is interpreted as a list of indices (or keys for mapping).
- What about default for missing values? Added complexity; dirtier API.
If key is not callable or a list, then use key as an index (or key for mapping).

Is this the sort of thing you have in mind?

I am also considering making the init argument optional. If it isn't provided, we could:

Use the first value received for the key as the initial value (like builtin reduce)
Reduce using a unary operator (like merge_with).

eriknw mentioned this issue Apr 8, 2014

toolz.merge_with not lazy, breaks with Clojure interface pytoolz/toolz#153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient `reduceby` #1

Efficient `reduceby` #1

mrocklin commented Apr 5, 2014

eriknw commented Apr 8, 2014

Efficient reduceby #1

Efficient reduceby #1

Comments

mrocklin commented Apr 5, 2014

eriknw commented Apr 8, 2014

Efficient `reduceby` #1

Efficient `reduceby` #1