Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient reduceby #1

Open
mrocklin opened this issue Apr 5, 2014 · 1 comment
Open

Efficient reduceby #1

mrocklin opened this issue Apr 5, 2014 · 1 comment

Comments

@mrocklin
Copy link
Member

mrocklin commented Apr 5, 2014

Reduceby is pytoolz' implementation to the split-apply-combine strategy, a very common data analysis pattern. It's equivalent to pandas'

df.groupby(index)[selection].apply(reduction)

It's also equivalent to Julia's DataFrame's by operation http://juliastats.github.io/DataFrames.jl/split_apply_combine.html

To my knowledge no one uses reduceby directly, its interface, while simple, is hard enough to scare away most non-experts.

The current interface accepts Python functions to split (often a get), and pushes the apply-combine steps into a single associative binary operator. This use of external functions is idiomatic for toolz but limits efficiency for cytoolz. How can we modify the API (or create an entirely new API) to hit this application with great efficiency?

A fast, intuitive, streaming split-apply-combine operation on core data structures would be a serious motivator for some.

@eriknw
Copy link
Member

eriknw commented Apr 8, 2014

Good problem statement. I think this may require a decent amount of consideration to achieve the goals you stated.

Split (key argument):

  • Check if key is callable. This is the most generic case.
  • Check if key is a list. This is interpreted as a list of indices (or keys for mapping).
    • What about default for missing values? Added complexity; dirtier API.
  • If key is not callable or a list, then use key as an index (or key for mapping).

Is this the sort of thing you have in mind?

I am also considering making the init argument optional. If it isn't provided, we could:

  1. Use the first value received for the key as the initial value (like builtin reduce)
  2. Reduce using a unary operator (like merge_with).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants