- Learning Spark SQL
- Aurobindo Sarkar
- 286字
- 2021-07-02 18:23:47
Identifying data outliers
An outlier or an anomaly is an observation of the data that deviates significantly from other observations in the Dataset. These erroneous outliers can be due to errors in the data-collection or variability in measurement. They can impact the results significantly so it is imperative to identify them during the EDA process.
However, these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using statistical distributions, and the outliers are identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically does not have enough knowledge about the underlying data distribution.
EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. Spark MLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. For example, we can apply clustering algorithms and visualize the results to detect outliers in a combination columns. In the following example, we use the last contact duration, in seconds (duration), number of contacts performed during this campaign, for this client (campaign), number of days that have passed by after the client was last contacted from a previous campaign (pdays) and the previous: number of contacts performed before this campaign and for this client (prev) values to compute two clusters in our data by applying the k-means clustering algorithm:

Other distributed algorithms useful for EDA include classification, regression, dimensionality reduction, correlation, and hypothesis testing. More details on using Spark SQL and these algorithms are covered in Chapter 6, Using Spark SQL in Machine Learning Applications.
- UI設計基礎培訓教程
- Vue 3移動Web開發與性能調優實戰
- PyTorch自動駕駛視覺感知算法實戰
- Getting Started with CreateJS
- Object-Oriented JavaScript(Second Edition)
- WordPress Plugin Development Cookbook(Second Edition)
- Java軟件開發基礎
- Access 2016數據庫管
- 用戶體驗增長:數字化·智能化·綠色化
- Kotlin編程實戰:創建優雅、富于表現力和高性能的JVM與Android應用程序
- 區塊鏈技術與應用
- Python深度學習原理、算法與案例
- Learning Hadoop 2
- Python應用開發技術
- Raspberry Pi Blueprints