官术网_书友最值得收藏!

  • Advanced Machine Learning with R
  • Cory Lesmeister Dr. Sunil Kumar Chinnamgari
  • 318字
  • 2021-06-24 14:24:33

Descriptive statistics

Traditionally, we could use the base R summary() function to identify some basic statistics. Now, and recently I might add, I like to use the package sjmisc and its descr() function. It produces a more readable output, and you can assign that output to a dataframe. What works well is to create that dataframe, save it as a .csv, and explore it at your leisure. It automatically selects numeric features only. It also fits well with tidyverse so that you can incorporate dplyr functions such as group_by() and filter(). Here's an example in our case where we examine the descriptive stats for the infantry of the Confederate Army. The output will consist of the following:

  • var: feature name
  • type: integer
  • n: number of observations
  • NA.prc: percent of missing values
  • mean
  • sd: standard deviation
  • se: standard error
  • md: median
  • trimmed: trimmed mean
  • range
  • skew
gettysburg %>%
dplyr::filter(army == "Confederate" & type == "Infantry") %>%
sjmisc::descr() -> descr_stats

readr::write_csv(descr_stats, 'descr_stats.csv')

The following is abbreviated output from the preceding code saved to a spreadsheet:

In this one table, we can discern some rather interesting tidbits. In particular is the percent of missing values per feature. If you modify the precious code to examine the Union Army, you'll find that there're no missing values. The reason the usurpers from the South had missing values is based on a couple of factors; either shoddy staff work in compiling the numbers on July 3rd or the records were lost over the years. Note that, for the number of men captured, if you remove the missing value, all other values are zero, so we could just replace the missing value with it. The Rebels did not report troops as captured, but rather as missing, in contrast with the Union.

Once you feel comfortable with the descriptive statistics, move on to exploring the categorical features in the next section.

主站蜘蛛池模板: 无锡市| 东乌珠穆沁旗| 桂林市| 海林市| 东乌珠穆沁旗| 丹寨县| 洪雅县| 华安县| 娄烦县| 资阳市| 宝山区| 师宗县| 麦盖提县| 赤水市| 女性| 邹平县| 延长县| 景宁| 阿城市| 平江县| 乐业县| 乐至县| 于都县| 普宁市| 平塘县| 湖北省| 兴隆县| 阿坝| 积石山| 高雄市| 都江堰市| 潞西市| 云浮市| 庆云县| 四子王旗| 常宁市| 汪清县| 且末县| 牙克石市| 延吉市| 木里|