Package

org.apache.spark.sql

execution

Permalink

package execution

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. execution
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class Approximate extends Comparable[Approximate] with Ordered[Approximate] with Serializable

    Permalink

    True count is > lower Bound & less than Max , with the given probability

    True count is > lower Bound & less than Max , with the given probability

    Annotations
    @SQLUserDefinedType()
  2. class ApproximateType extends UserDefinedType[Approximate]

    Permalink
  3. class CMSParams extends Serializable

    Permalink
  4. class Hokusai[T] extends AnyRef

    Permalink

    Implements the algorithms and data structures from "Hokusai -- Sketching Streams in Real Time", by Sergiy Matusevych, Alexander Smola, Amr Ahmed.

    Implements the algorithms and data structures from "Hokusai -- Sketching Streams in Real Time", by Sergiy Matusevych, Alexander Smola, Amr Ahmed. http://www.auai.org/uai2012/papers/231.pdf

    Aggregates state, so this is a mutable class.

    Since we are all still learning scala, I thought I'd explain the use of implicits in this file. TimeAggregation takes an implicit constructor parameter: TimeAggregation[T]()(implicit val cmsMonoid: CMSMonoid[T]) The purpose for that is: + In Algebird, a CMSMonoid[T] is a factory for creating instances of CMS[T] + TimeAggregation needs to occasionally make new CMS instances, so it will use the factory + By making it an implicit (and in the curried param), the outer context of the TimeAggregation can create/ensure that the factory is there. + Hokusai[T] will be the "outer context" so it can handle that for TimeAggregation

    TODO 1. Decide if the underlying CMS should be mutable (save memory) or functional (algebird) I'm afraid that with the functional approach, and having so many, every time we merge two CMS, we create a third and that is wasteful of memory or may take too much memory. If we go with a mutable CMS, we have to either make stream-lib's serializable, or make our own.

    2. Clean up intrusion of algebird shenanigans in the code (implicit factories etc)

    3. Decide on API for managing data and time. Do we increment time in a separate operation or add a time parameter to addData()?

    4. Decide if we want to be mutable or functional in this datastruct. Current version is mutable.

  5. class IntervalTracker extends AnyRef

    Permalink
  6. final class KeyFrequencyWithTimestamp[T] extends AnyRef

    Permalink
  7. class ReservoirRegionSegmentMap[V] extends ReentrantReadWriteLock with SegmentMap[Row, V] with Serializable

    Permalink

    Created by vivekb on 21/10/16.

  8. final case class SampleOptions(qcs: Array[Int], name: String, fraction: Double, stratumSize: Int, errorLimitColumn: Int, errorLimitPercent: Double, memBatchSize: Int, timeSeriesColumn: Int, timeInterval: Long, concurrency: Int, schema: StructType, bypassSampling: Boolean, qcsPlan: Option[(CodeAndComment, ArrayBuffer[Any], Int, Array[DataType])]) extends Serializable with Product

    Permalink
  9. final class SamplePartition extends Partition with Serializable with Logging

    Permalink
  10. class SnappyContextAQPFunctions extends SnappyContextFunctions with Logging

    Permalink
  11. case class StratifiedSample(options: Map[String, Any], child: LogicalPlan, baseTable: Option[TableIdentifier] = None)(qcs: (Array[Int], Array[String]) = ..., weightedColumn: AttributeReference = ...) extends UnaryNode with Product with Serializable

    Permalink
  12. case class StratifiedSampleExecute(child: SparkPlan, output: Seq[Attribute], options: Map[String, Any], qcs: Array[Int]) extends SparkPlan with UnaryExecNode with Product with Serializable

    Permalink

    Perform stratified sampling given a Query-Column-Set (QCS).

    Perform stratified sampling given a Query-Column-Set (QCS). This variant can also use a fixed fraction to be sampled instead of fixed number of total samples since it is also designed to be used with streaming data.

  13. final class StratifiedSampledRDD extends RDD[InternalRow] with Serializable

    Permalink
  14. abstract class StratifiedSampler extends Serializable with Cloneable with Logging

    Permalink
  15. class StratifiedSamplerCached extends StratifiedSampler with CastLongTime

    Permalink

    A stratified sampling implementation that uses a fraction and initial cache size.

    A stratified sampling implementation that uses a fraction and initial cache size. Latter is used as the initial reservoir size per stratum for reservoir sampling. It primarily tries to satisfy the fraction of the total data repeatedly filling up the cache as required (and expanding the cache size for bigger reservoir if required in next rounds). The fraction is attempted to be satisfied while ensuring that the selected rows are equally divided among the current stratum (for those that received any rows, that is).

  16. class StratifiedSamplerCachedInRegion extends StratifiedSamplerCached

    Permalink

    Created by vivekb on 14/10/16.

    Created by vivekb on 14/10/16.

    Attributes
    protected
  17. final class StratifiedSamplerErrorLimit extends StratifiedSampler with CastLongTime

    Permalink

    A stratified sampling implementation that uses an error limit with confidence on a numerical column to sample as much as required to maintaining the expected error within the limit.

    A stratified sampling implementation that uses an error limit with confidence on a numerical column to sample as much as required to maintaining the expected error within the limit. An optional initial cache size can be specified that is used as the initial reservoir size per stratum for reservoir sampling. The error limit is attempted to be honoured for each of the stratum independently and the sampling rate increased or decreased accordingly. It uses standard closed form estimation of the sampling error increasing or decreasing the sampling as required (and expanding the cache size for bigger reservoir if required in next rounds).

  18. final class StratifiedSamplerReservoir extends StratifiedSampler

    Permalink

    A simple reservoir based stratified sampler that will use the provided reservoir size for every stratum present in the incoming rows.

  19. final class StratumCache extends StratumReservoir

    Permalink

    An extension to StratumReservoir to also track total samples seen since last time slot and short fall from previous rounds.

  20. class StratumReservoir extends DataSerializable with Sizeable with Serializable

    Permalink

    For each stratum (i.e.

    For each stratum (i.e. a unique set of values for QCS), keep a set of meta-data including number of samples collected, total number of rows in the stratum seen so far, the QCS key, reservoir of samples etc.

  21. final class TopKHokusai[T] extends Hokusai[T] with TopK

    Permalink
  22. final class TopKWrapper extends ReadWriteLock with CastLongTime with Serializable

    Permalink
    Attributes
    protected[org.apache.spark.sql]

Value Members

  1. object Approximate extends Serializable

    Permalink
  2. object ApproximateType extends ApproximateType

    Permalink
  3. object CMSParams extends Serializable

    Permalink
  4. object Hokusai

    Permalink
  5. object IntervalTracker

    Permalink
  6. object SnappyContextAQPFunctions

    Permalink
  7. object SnappyQueryExecution

    Permalink
  8. object StratifiedSampler extends Serializable

    Permalink
  9. object TopKHokusai extends Serializable

    Permalink
  10. object TopKWrapper extends Serializable

    Permalink
  11. package bootstrap

    Permalink
  12. package closedform

    Permalink
  13. package cms

    Permalink
  14. package command

    Permalink
  15. package common

    Permalink
  16. package serializer

    Permalink
  17. package streamsummary

    Permalink

Inherited from AnyRef

Inherited from Any

Ungrouped