- Statistics for Data Science
- James D. Miller
- 325字
- 2021-07-02 14:58:58
Outliers
The simplest explanation for what outliers are might be is to say that outliers are those data points that just don't fit the rest of your data. Upon observance, any data that is either very high, very low, or just unusual (within the context of your project), is an outlier. As part of data cleansing, a data scientist would typically identify the outliers and then address the outliers using a generally accepted method:
- Delete the outlier values or even the actual variable where the outliers exist
- Transform the values or the variable itself
Let's look at a real-world example of using R to identify and then address data outliers.
In the world of gaming, slot machines (a gambling machine operated by inserting coins into a slot and pulling a handle which determines the payoff) are quite popular. Most slot machines today are electronic and therefore are programmed in such a way that all their activities are continuously tracked. In our example, investors in a casino want to use this data (as well as various supplementary data) to drive adjustments to their profitability strategy. In other words, what makes for a profitable slot machine? Is it the machine's theme or its type? Are newer machines more profitable than older or retro machines? What about the physical location of the machine? Are lower denomination machines more profitable? We try to find our answers using the outliers.
We are given a collection or pool of gaming data (formatted as a comma-delimited or CSV text file), which includes data points such as the location of the slot machine, its denomination, month, day, year, machine type, age of the machine, promotions, coupons, weather, and coin-in (which is the total amount inserted into the machine less pay-outs). The first step for us as a data scientist is to review (sometimes called profile) the data, where we'll determine if any outliers exist. The second step will be to address those outliers.
- 程序設(shè)計缺陷分析與實踐
- 精通Excel VBA
- 完全掌握AutoCAD 2008中文版:綜合篇
- CompTIA Network+ Certification Guide
- 3D Printing for Architects with MakerBot
- 網(wǎng)絡(luò)安全與防護
- TensorFlow Reinforcement Learning Quick Start Guide
- Linux嵌入式系統(tǒng)開發(fā)
- FPGA/CPLD應(yīng)用技術(shù)(Verilog語言版)
- Linux Shell編程從初學(xué)到精通
- 重估:人工智能與賦能社會
- Hands-On SAS for Data Analysis
- 單片機C51應(yīng)用技術(shù)
- Win 7二十一
- 亮劍.NET:圖解ASP.NET網(wǎng)站開發(fā)實戰(zhàn)