-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reservoir class #1
Comments
Can you expand on the API that you expect reservoir sampling to have, and why you recommend a class with A basic API would be a function that inputs an iterable and the number of samples to draw, and the output would be a list of length n of the samples. Instead of choosing fixed |
Choosing a fixed rate to drop off is another useful activity. This could happen either deterministically (keep every third element) or probabilistically (keep an element with probability 1/3). These could both be generator functions. In reservoir sampling you want a uniform sample of the stream seen so far. So if our reservoir is of size ten then after seeing 100 elements of the stream each element should have probability 1/10 of being in the reservoir. After seeing 1000 elements then every element should have probability 1/100 of being in the reservoir. In principle the input stream could be infinite. One solution to this is to make a generator function that yields the state of the reservoir after receiving each new element from the stream. This is a pure though highly inefficient solution. The other solution is to make a collection-like datastructure ( I can also build this some time in the future. It's a nice thing to know about though. It's one of the core structures for streaming data. |
Ah, that's what I was missing. Thanks. The user may want the intermediate states. The wikipedia article and other explanations I've seen in recent days describe the initial iterator or stream being fully exhausted, then returning the final list. I like "seen so far", as this will actually work on infinite(-ish) streams. |
Yeah, streaming is a nice application. There are a number of fun variations. |
Reservoir sampling is important. It's probably complex enough to require a separate data structure. My view of it has an
add
method and aset
property. Maybe it follows set semantics at least for operations like union.The text was updated successfully, but these errors were encountered: