官术网_书友最值得收藏!

  • Advanced Machine Learning with R
  • Cory Lesmeister Dr. Sunil Kumar Chinnamgari
  • 217字
  • 2021-06-24 14:24:33

Handling duplicate observations

The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:

dupes <- duplicated(gettysburg)

table(dupes)
dupes
FALSE TRUE
587 3

which(dupes == "TRUE")
[1] 588 589
If you want to see the actual rows and even put them into a tibble dataframe, the janitor package has the get_dupes()  function. The code for that would be simply:  df_dupes <- janitor::get_dupes(gettysburg).

To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:

gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)

Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features. 

With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.

主站蜘蛛池模板: 冕宁县| 枣庄市| 杂多县| 江门市| 仙游县| 上栗县| 江阴市| 邢台市| 龙里县| 永安市| 额敏县| 江门市| 特克斯县| 南华县| 获嘉县| 阿勒泰市| 绥化市| 多伦县| 余庆县| 青海省| 桑植县| 基隆市| 股票| 调兵山市| 雷州市| 贵阳市| 丰镇市| 射阳县| 福建省| 富顺县| 潼关县| 肥乡县| 永福县| 鄯善县| 连山| 宜兰县| 独山县| 资中县| 固镇县| 宜春市| 武宁县|