Packages

package splits

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. Protected

Type Members

  1. case class BoltzmannSplitter(temperature: Double, rng: Random = Random) extends Splitter[Double] with Product with Serializable

    Find a split for a regression problem

    Find a split for a regression problem

    The splits are picked with a probability that is related to the reduction in variance: P(split) ~ exp[ - {remaining variance} / ({temperature} * {total variance}) ] recalling that the "variance" here is weighted by the sample size (so its really the sum of the square difference from the mean of that side of the split). This is analogous to simulated annealing and Metropolis-Hastings.

    The motivation here is to reduce the correlation of the trees by making random choices between splits that are almost just as good as the strictly optimal one. Reducing the correlation between trees will reduce the variance in an ensemble method (e.g. random forests): the variance will both decrease more quickly with the tree count and will reach a lower floor. In this paragraph, we're using "variance" as in "bias-variance trade-off".

    Division by the local total variance make the splitting behavior invariant to data size and the scale of the labels. That means, however, that you can't set the temperature based on a known absolute noise scale. For that, you'd want to divide by the total weight rather than the total variance.

    TODO: allow the rescaling to happen based on the total weight instead of the total variance, as an option

    Created by maxhutch on 11/29/16.

    temperature

    used to control how sensitive the probability of a split is to its change in variance. The temperature can be thought of as a hyperparameter.

  2. class CategoricalSplit extends Split

    Split based on inclusion in a set

  3. case class ClassificationSplitter(randomizedPivotLocation: Boolean = false, rng: Random = Random) extends Splitter[Char] with Product with Serializable

    Find the best split for classification problems.

    Find the best split for classification problems.

    Created by maxhutch on 12/2/16.

  4. case class ExtraRandomClassificationSplitter(rng: Random = Random) extends Splitter[Char] with Product with Serializable

    A splitter that defines Extremely Randomized Trees

    A splitter that defines Extremely Randomized Trees

    This is based on Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach Learn 63, 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1.

    rng

    random number generator

  5. case class ExtraRandomRegressionSplitter(rng: Random = Random) extends Splitter[Double] with Product with Serializable

    A splitter that defines Extremely Randomized Trees

    A splitter that defines Extremely Randomized Trees

    This is based on Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach Learn 63, 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1.

    rng

    random number generator

  6. case class MultiTaskSplitter(randomizePivotLocation: Boolean = false, rng: Random = Random) extends Splitter[Array[AnyVal]] with Product with Serializable

    Created by maxhutch on 11/29/16.

  7. class NoSplit extends Split

    If no split was found

  8. class RealSplit extends Split

    Split based on a real value in the index position

  9. case class RegressionSplitter(randomizePivotLocation: Boolean = false, rng: Random = Random) extends Splitter[Double] with Product with Serializable

    Find the best split for regression problems.

    Find the best split for regression problems.

    The best split is the one that reduces the total weighted variance: totalVariance = N_left * \sigma_left2 + N_right * \sigma_right2 which, in scala-ish, would be: totalVariance = leftWeight * (leftSquareSum /leftWeight - (leftSum / leftWeight )2) + rightWeight * (rightSquareSum/rightWeight - (rightSum / rightWeight)2) Because we are comparing them, we can subtract off leftSquareSum + rightSquareSum, which yields the following simple expression after some simplification: totalVariance = -leftSum * leftSum / leftWeight - Math.pow(totalSum - leftSum, 2) / (totalWeight - leftWeight) which depends only on updates to leftSum and leftWeight (since totalSum and totalWeight are constant).

    Created by maxhutch on 11/29/16.

  10. trait Split extends Serializable

    Splits are used by decision trees to partition the input space

  11. trait Splitter[T] extends AnyRef

    Created by maxhutch on 7/5/17.

Value Members

  1. object BoltzmannSplitter extends Serializable
  2. object Splitter

Ungrouped