Add reservoir class #1

mrocklin · 2013-11-21T19:15:25Z

Reservoir sampling is important. It's probably complex enough to require a separate data structure. My view of it has an add method and a set property. Maybe it follows set semantics at least for operations like union.

The text was updated successfully, but these errors were encountered:

eriknw · 2013-11-21T19:53:03Z

Can you expand on the API that you expect reservoir sampling to have, and why you recommend a class with add and set? I expect you are more familiar with this than I.

A basic API would be a function that inputs an iterable and the number of samples to draw, and the output would be a list of length n of the samples. Instead of choosing fixed n, a convenience might be to provide a percentage or rate of sampling (such as a third).

mrocklin · 2013-11-21T20:01:49Z

Choosing a fixed rate to drop off is another useful activity. This could happen either deterministically (keep every third element) or probabilistically (keep an element with probability 1/3). These could both be generator functions.

In reservoir sampling you want a uniform sample of the stream seen so far. So if our reservoir is of size ten then after seeing 100 elements of the stream each element should have probability 1/10 of being in the reservoir. After seeing 1000 elements then every element should have probability 1/100 of being in the reservoir. In principle the input stream could be infinite.

One solution to this is to make a generator function that yields the state of the reservoir after receiving each new element from the stream. This is a pure though highly inefficient solution. The other solution is to make a collection-like datastructure (res = Reservoir(n=10)) to which we can add elements (res.add(elem)). At any point we can ask for the current state of the reservoir (res.set or set(res)). By adding elements in a clever way we can guarantee that the reservoir is a uniform sample of the stream seen so far. Whenever we add an element we need to kick out some other element current residing in the reservoir. This occurs with decreasing probability as we see more and more elements.

I can also build this some time in the future. It's a nice thing to know about though. It's one of the core structures for streaming data.

mrocklin · 2013-11-21T20:03:45Z

See http://en.wikipedia.org/wiki/Reservoir_sampling

eriknw · 2013-11-21T20:18:28Z

[...] seen so far.

Ah, that's what I was missing. Thanks. The user may want the intermediate states. The wikipedia article and other explanations I've seen in recent days describe the initial iterator or stream being fully exhausted, then returning the final list.

I like "seen so far", as this will actually work on infinite(-ish) streams.

mrocklin · 2013-11-21T20:20:43Z

Yeah, streaming is a nice application. There are a number of fun variations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reservoir class #1

Add reservoir class #1

mrocklin commented Nov 21, 2013

eriknw commented Nov 21, 2013

mrocklin commented Nov 21, 2013

mrocklin commented Nov 21, 2013

eriknw commented Nov 21, 2013

mrocklin commented Nov 21, 2013

Add reservoir class #1

Add reservoir class #1

Comments

mrocklin commented Nov 21, 2013

eriknw commented Nov 21, 2013

mrocklin commented Nov 21, 2013

mrocklin commented Nov 21, 2013

eriknw commented Nov 21, 2013

mrocklin commented Nov 21, 2013