True count is > lower Bound & less than Max , with the given probability
True count is > lower Bound & less than Max , with the given probability
Implements the algorithms and data structures from "Hokusai -- Sketching Streams in Real Time", by Sergiy Matusevych, Alexander Smola, Amr Ahmed.
Implements the algorithms and data structures from "Hokusai -- Sketching Streams in Real Time", by Sergiy Matusevych, Alexander Smola, Amr Ahmed. http://www.auai.org/uai2012/papers/231.pdf
Aggregates state, so this is a mutable class.
Since we are all still learning scala, I thought I'd explain the use of implicits in this file. TimeAggregation takes an implicit constructor parameter: TimeAggregation[T]()(implicit val cmsMonoid: CMSMonoid[T]) The purpose for that is: + In Algebird, a CMSMonoid[T] is a factory for creating instances of CMS[T] + TimeAggregation needs to occasionally make new CMS instances, so it will use the factory + By making it an implicit (and in the curried param), the outer context of the TimeAggregation can create/ensure that the factory is there. + Hokusai[T] will be the "outer context" so it can handle that for TimeAggregation
TODO 1. Decide if the underlying CMS should be mutable (save memory) or functional (algebird) I'm afraid that with the functional approach, and having so many, every time we merge two CMS, we create a third and that is wasteful of memory or may take too much memory. If we go with a mutable CMS, we have to either make stream-lib's serializable, or make our own.
2. Clean up intrusion of algebird shenanigans in the code (implicit factories etc)
3. Decide on API for managing data and time. Do we increment time in a separate operation or add a time parameter to addData()?
4. Decide if we want to be mutable or functional in this datastruct. Current version is mutable.
Created by vivekb on 21/10/16.
Perform stratified sampling given a Query-Column-Set (QCS).
Perform stratified sampling given a Query-Column-Set (QCS). This variant can also use a fixed fraction to be sampled instead of fixed number of total samples since it is also designed to be used with streaming data.
A stratified sampling implementation that uses a fraction and initial cache size.
A stratified sampling implementation that uses a fraction and initial cache size. Latter is used as the initial reservoir size per stratum for reservoir sampling. It primarily tries to satisfy the fraction of the total data repeatedly filling up the cache as required (and expanding the cache size for bigger reservoir if required in next rounds). The fraction is attempted to be satisfied while ensuring that the selected rows are equally divided among the current stratum (for those that received any rows, that is).
Created by vivekb on 14/10/16.
Created by vivekb on 14/10/16.
A stratified sampling implementation that uses an error limit with confidence on a numerical column to sample as much as required to maintaining the expected error within the limit.
A stratified sampling implementation that uses an error limit with confidence on a numerical column to sample as much as required to maintaining the expected error within the limit. An optional initial cache size can be specified that is used as the initial reservoir size per stratum for reservoir sampling. The error limit is attempted to be honoured for each of the stratum independently and the sampling rate increased or decreased accordingly. It uses standard closed form estimation of the sampling error increasing or decreasing the sampling as required (and expanding the cache size for bigger reservoir if required in next rounds).
A simple reservoir based stratified sampler that will use the provided reservoir size for every stratum present in the incoming rows.
An extension to StratumReservoir to also track total samples seen since last time slot and short fall from previous rounds.
For each stratum (i.e.
For each stratum (i.e. a unique set of values for QCS), keep a set of meta-data including number of samples collected, total number of rows in the stratum seen so far, the QCS key, reservoir of samples etc.