- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 217字
- 2021-06-24 14:24:33
Handling duplicate observations
The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:
dupes <- duplicated(gettysburg)
table(dupes)
dupes
FALSE TRUE
587 3
which(dupes == "TRUE")
[1] 588 589
To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:
gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)
Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features.
With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.
- 硬件產(chǎn)品經(jīng)理手冊(cè):手把手構(gòu)建智能硬件產(chǎn)品
- Learning Game Physics with Bullet Physics and OpenGL
- 微服務(wù)分布式架構(gòu)基礎(chǔ)與實(shí)戰(zhàn):基于Spring Boot + Spring Cloud
- 嵌入式系統(tǒng)中的模擬電路設(shè)計(jì)
- VCD、DVD原理與維修
- SiFive 經(jīng)典RISC-V FE310微控制器原理與實(shí)踐
- 筆記本電腦應(yīng)用技巧
- Internet of Things Projects with ESP32
- 單片機(jī)技術(shù)及應(yīng)用
- FL Studio Cookbook
- 新編電腦組裝與硬件維修從入門到精通
- STM32自學(xué)筆記
- Spring Cloud實(shí)戰(zhàn)
- 單片機(jī)項(xiàng)目設(shè)計(jì)教程
- Arduino項(xiàng)目案例:游戲開發(fā)