- Scala for Machine Learning(Second Edition)
- Patrick R. Nicolas
- 332字
- 2021-07-08 10:43:05
Profiling data
The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax
class for normalizing a dataset using the minimum and maximum values
.
Immutable statistics
The mean and standard deviation are the most commonly used statistics.
Note
Mean and variance
Arithmetic mean:

Variance:

Variance adjusted for sampling bias:

Let's extend the MinMax
class with some basic statistics capabilities, Stats
:
class Stats[T: ToDouble](values: Vector[T]) extends MinMax[T](values) { val zero = (0.0. 0.0) val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1 lazy val mean = sums._1/values.size //2 lazy val variance = (sums._2 - mean*mean*values.size)/(values.size-1) lazy val stdDev = sqrt(variance) … }
The class Stats
implements immutable statistics. Its constructor computes the sum of values
and sum of square values, sums
(line 1). The statistics such as mean
and variance
are computed once when needed by declaring these values lazy (line 2). The class Stats
inherits the normalization functions of MinMax
.
Z-score and Gauss
The Gaussian distribution of input data is implemented by the gauss
method of the Stats
class:
Note
Gaussian distribution
M1: Gaussian for a mean μ and a standard deviation σ transformation:

def gauss(mu: Double, sigma: Double, x: Double): Double = {
val y = (x - mu)/sigma
INV_SQRT_2PI*Math.exp(-0.5*y*y)/sigma
}
val normal = gauss(1.0, 0.0, _: Double)
The computation of the normal distribution is computed as a partially applied function. The Z-score is computed as a normalization of the raw data taking into account the standard deviation.
Note
Z-score normalization
M2: Z-score for a mean μ and a standard deviation σ:

The computation of the Z-score is implemented by the method zScore
of Stats
:
def zScore: DblVec = values.map(x => (x - mean)/stdDev )
The following chart illustrates the relative behavior of the normalization, zScore
, and normal transformation:

Comparative analysis of linear, Gaussian, and Z-score normalization
- 基于粒計(jì)算模型的圖像處理
- Java程序設(shè)計(jì)與開發(fā)
- Modular Programming with Python
- Building Modern Web Applications Using Angular
- Mastering Predictive Analytics with Python
- Learning Network Forensics
- H5頁面設(shè)計(jì):Mugeda版(微課版)
- C++新經(jīng)典
- Learning Docker Networking
- Mastering Concurrency Programming with Java 9(Second Edition)
- Python硬件編程實(shí)戰(zhàn)
- iOS Development with Xamarin Cookbook
- HTML并不簡單:Web前端開發(fā)精進(jìn)秘籍
- Building Microservices with Go
- 系統(tǒng)分析師UML用例實(shí)戰(zhàn)