官术网_书友最值得收藏!

Outlier detection

Outliers are very important to be taken into consideration for any analysis as they can make analysis biased. There are various ways to detect outliers in R and the most common one will be discussed in this section.

Boxplot

Let us construct a boxplot for the variable volume of the Sampledata, which can be done by executing the following code:

> boxplot(Sampledata$Volume, main="Volume", boxwex=0.1) 

The graph is as follows:

Boxplot

Figure 2.16: Boxplot for outlier detection

An outlier is an observation which is distant from the rest of the data. When reviewing the preceding boxplot, we can clearly see the outliers which are located outside the fences (whiskers) of the boxplot.

LOF algorithm

The local outlier factor (LOF) is used for identifying density-based local outliers. In LOF, the local density of a point is compared with that of its neighbors. If the point is in a sparser region than its neighbors then it is treated as an outlier. Let us consider some of the variables from the Sampledata and execute the following code:

> library(DMwR) 
> Sampledata1<- Sampledata[,2:4] 
> outlier.scores <- lofactor(Sampledata1, k=4) 
> plot(density(outlier.scores)) 

Here, k is the number of neighbors used in the calculation of the local outlier factors.

The graph is as follows:

LOF algorithm

Figure 2.17: Plot showing outliers by LOF method

If you want the top five outliers then execute the following code:

> order(outlier.scores, decreasing=T)[1:5] 

This gives an output with the row numbers:

[1] 50 34 40 33 22 
主站蜘蛛池模板: 吉木乃县| 建昌县| 丘北县| 辽中县| 万荣县| 托里县| 商河县| 玛多县| 启东市| 荣昌县| 紫云| 南陵县| 开平市| 泾川县| 安新县| 临泉县| 祁东县| 汾阳市| 揭阳市| 岫岩| 西吉县| 文化| 阿鲁科尔沁旗| 香港 | 名山县| 黄平县| 阿拉善左旗| 鹤岗市| 阿拉善盟| 根河市| 滦南县| 武宁县| 饶阳县| 纳雍县| 汝城县| 清水河县| 秦皇岛市| 呈贡县| 襄城县| 曲阳县| 岳池县|