官术网_书友最值得收藏!

Profiling data

The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax class for normalizing a dataset using the minimum and maximum values.

Immutable statistics

The mean and standard deviation are the most commonly used statistics.

Note

Mean and variance

Arithmetic mean:

Variance:

Variance adjusted for sampling bias:

Let's extend the MinMax class with some basic statistics capabilities, Stats:

class Stats[T: ToDouble](values: Vector[T]) 
extends MinMax[T](values) {

  val zero = (0.0. 0.0)
  val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1
  lazy val mean = sums._1/values.size //2
  lazy val variance = 
     (sums._2 - mean*mean*values.size)/(values.size-1)
  lazy val stdDev = sqrt(variance)
…
}

The class Stats implements immutable statistics. Its constructor computes the sum of values and sum of square values, sums (line 1). The statistics such as mean and variance are computed once when needed by declaring these values lazy (line 2). The class Stats inherits the normalization functions of MinMax.

Z-score and Gauss

The Gaussian distribution of input data is implemented by the gauss method of the Stats class:

Note

Gaussian distribution

M1: Gaussian for a mean μ and a standard deviation σ transformation:

def gauss(mu: Double, sigma: Double, x: Double): Double = {
   val y = (x - mu)/sigma
   INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma
}
val normal = gauss(1.0, 0.0, _: Double)

The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.

Note

Z-score normalization

M2: Z-score for a mean μ and a standard deviation σ:

The computation of the Z-score is implemented by the method zScore of Stats:

def zScore: DblVec = values.map(x => (x - mean)/stdDev )

The following chart illustrates the relative behavior of the normalization, zScore, and normal transformation:

Comparative analysis of linear, Gaussian, and Z-score normalization

主站蜘蛛池模板: 河东区| 兴隆县| 娄烦县| 云南省| 贵溪市| 天门市| 石家庄市| 澎湖县| 安平县| 榆中县| 太谷县| 类乌齐县| 南溪县| 张家川| 天长市| 安远县| 连山| 阳高县| 瓦房店市| 张家川| 额济纳旗| 新田县| 绍兴县| 贡山| 铜川市| 麻城市| 邵东县| 开平市| 永吉县| 台南市| 万荣县| 新郑市| 宽城| 阿拉善盟| 清丰县| 那坡县| 板桥市| 南岸区| 佛冈县| 永兴县| 永安市|