- Mastering Machine Learning with R
- Cory Lesmeister
- 217字
- 2021-07-02 13:46:19
Handling duplicate observations
The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:
dupes <- duplicated(gettysburg)
table(dupes)
dupes
FALSE TRUE
587 3
which(dupes == "TRUE")
[1] 588 589
To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:
gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)
Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features.
With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.
- 電力自動化實用技術問答
- R Data Mining
- PowerShell 3.0 Advanced Administration Handbook
- Hands-On Artificial Intelligence on Amazon Web Services
- MCSA Windows Server 2016 Certification Guide:Exam 70-741
- PyTorch深度學習實戰
- Mastering Machine Learning Algorithms
- AWS Administration Cookbook
- 空間站多臂機器人運動控制研究
- SAP Business Intelligence Quick Start Guide
- 統計挖掘與機器學習:大數據預測建模和分析技術(原書第3版)
- Hands-On Data Warehousing with Azure Data Factory
- Unity Multiplayer Games
- 水晶石影視動畫精粹:After Effects & Nuke 影視后期合成
- 3ds Max造型表現藝術