Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reservoir class #1

Open
mrocklin opened this issue Nov 21, 2013 · 5 comments
Open

Add reservoir class #1

mrocklin opened this issue Nov 21, 2013 · 5 comments

Comments

@mrocklin
Copy link

Reservoir sampling is important. It's probably complex enough to require a separate data structure. My view of it has an add method and a set property. Maybe it follows set semantics at least for operations like union.

@eriknw
Copy link
Owner

eriknw commented Nov 21, 2013

Can you expand on the API that you expect reservoir sampling to have, and why you recommend a class with add and set? I expect you are more familiar with this than I.

A basic API would be a function that inputs an iterable and the number of samples to draw, and the output would be a list of length n of the samples. Instead of choosing fixed n, a convenience might be to provide a percentage or rate of sampling (such as a third).

@mrocklin
Copy link
Author

Choosing a fixed rate to drop off is another useful activity. This could happen either deterministically (keep every third element) or probabilistically (keep an element with probability 1/3). These could both be generator functions.

In reservoir sampling you want a uniform sample of the stream seen so far. So if our reservoir is of size ten then after seeing 100 elements of the stream each element should have probability 1/10 of being in the reservoir. After seeing 1000 elements then every element should have probability 1/100 of being in the reservoir. In principle the input stream could be infinite.

One solution to this is to make a generator function that yields the state of the reservoir after receiving each new element from the stream. This is a pure though highly inefficient solution. The other solution is to make a collection-like datastructure (res = Reservoir(n=10)) to which we can add elements (res.add(elem)). At any point we can ask for the current state of the reservoir (res.set or set(res)). By adding elements in a clever way we can guarantee that the reservoir is a uniform sample of the stream seen so far. Whenever we add an element we need to kick out some other element current residing in the reservoir. This occurs with decreasing probability as we see more and more elements.

I can also build this some time in the future. It's a nice thing to know about though. It's one of the core structures for streaming data.

@mrocklin
Copy link
Author

@eriknw
Copy link
Owner

eriknw commented Nov 21, 2013

[...] seen so far.

Ah, that's what I was missing. Thanks. The user may want the intermediate states. The wikipedia article and other explanations I've seen in recent days describe the initial iterator or stream being fully exhausted, then returning the final list.

I like "seen so far", as this will actually work on infinite(-ish) streams.

@mrocklin
Copy link
Author

Yeah, streaming is a nice application. There are a number of fun variations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants