官术网_书友最值得收藏!

Identifying data outliers

An outlier or an anomaly is an observation of the data that deviates significantly from other observations in the Dataset. These erroneous outliers can be due to errors in the data-collection or variability in measurement. They can impact the results significantly so it is imperative to identify them during the EDA process.

However, these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using statistical distributions, and the outliers are identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically does not have enough knowledge about the underlying data distribution.

EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. Spark MLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler.  For example, we can apply clustering algorithms and visualize the results to detect outliers in a combination columns. In the following example, we use the last contact duration, in seconds (duration), number of contacts performed during this campaign, for this client (campaign), number of days that have passed by after the client was last contacted from a previous campaign (pdays) and the previous: number of contacts performed before this campaign and for this client (prev) values to compute two clusters in our data by applying the k-means clustering algorithm:

Other distributed algorithms useful for EDA include classification, regression, dimensionality reduction, correlation, and hypothesis testing. More details on using Spark SQL and these algorithms are covered in Chapter 6, Using Spark SQL in Machine Learning Applications.

主站蜘蛛池模板: 石首市| 寻甸| 郁南县| 阜阳市| 花莲市| 正安县| 奉新县| 萨嘎县| 六枝特区| 平阳县| 镇平县| 穆棱市| 永春县| 桦川县| 兰考县| 兴和县| 堆龙德庆县| 广饶县| 大余县| 东明县| 明水县| 灵山县| 邹平县| 锡林浩特市| 杂多县| 永泰县| 曲水县| 木兰县| 梅河口市| 锡林浩特市| 宕昌县| 巧家县| 东方市| 蕉岭县| 永济市| 大兴区| 上高县| 乐山市| 玛多县| 英山县| 河池市|