官术网_书友最值得收藏!

Hypothyroid

The hypothyroid dataset Hypothyroid.csv is available in the book's code bundle packet, located at /…/Chapter01/Data. While we have 26 variables in the dataset, we will only be using seven of these variables. Here, the number of observations is n = 3163. The dataset is downloaded from http://archive.ics.uci.edu/ml/datasets/thyroid+disease and the filename is hypothyroid.data (http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data). After some tweaks to the order of relabeling certain values, the CSV file is made available in the book's code bundle. The purpose of the study is to classify a patient with a thyroid problem based on the information provided by other variables. There are multiple variants of the dataset and the reader can delve into details at the following web page: http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/HELLO. Here, the column representing the variable of interest is named Hypothyroid, which shows that we have 151 patients with thyroid problems. The remaining 3012 tested negative for it. Clearly, this dataset is an example of unbalanced data, which means that one of the two cases is outnumbered by a huge number; for each thyroid case, we have about 20 negative cases. Such problems need to be handled differently, and we need to get into the subtleties of the algorithms to build meaningful models. The additional variables or covariates that we will use while building the predictive models include Age, Gender, TSH, T3, TT4, T4U, and FTI. The data is first imported into an R session and is subset according to the variables of interest as follows:

> HT <- read.csv("../Data/Hypothyroid.csv",header = TRUE,stringsAsFactors = F)
> HT$Hypothyroid <- as.factor(HT$Hypothyroid)
> HT2 <- HT[,c("Hypothyroid","Age","Gender","TSH","T3","TT4","T4U","FTI")]

The first line of code imports the data from the Hypothyroid.csv file using the read.csv function. The dataset now has a lot of missing data in the variables, as seen here:

> sapply(HT2,function(x) sum(is.na(x)))
Hypothyroid         Age      Gender         TSH          T3         TT4 
          0         446          73         468         695         249 
        T4U         FTI 
        248         247 

Consequently, we remove all the rows that have a missing value, and then split the data into training and testing datasets. We will also create a formula for the classification problem:

> HT2 <- na.omit(HT2)
> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(HT2),replace=TRUE, prob=c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> HT2_Train <- HT2[Train_Test=="Train",]
> HT2_TestX <- within(HT2[Train_Test=="Test",],rm(Hypothyroid))
> HT2_TestY <- HT2[Train_Test=="Test",c("Hypothyroid")]
> HT2_Formula <- as.formula("Hypothyroid~.")

The set.seed function ensures that the results are reproducible each time we run the program. After removing the missing observations with the na.omit function, we split the hypothyroid data into training and testing parts. The former is used to build the model and the latter is used to validate it, using data that has not been used to build the model. Quinlan – the inventor of the popular tree algorithm C4.5 – used this dataset extensively.

主站蜘蛛池模板: 贡山| 威信县| 兴和县| 长沙市| 红原县| 武宣县| 安吉县| 伊金霍洛旗| 来安县| 湟源县| 漯河市| 方正县| 青海省| 新巴尔虎右旗| 绥德县| 陆丰市| 新兴县| 沁阳市| 万全县| 辉县市| 金昌市| 柘荣县| 略阳县| 桦甸市| 封丘县| 娄底市| 山丹县| 吉林市| 文水县| 新乡市| 林州市| 浑源县| 临安市| 霍山县| 玛多县| 民丰县| 天镇县| 沙田区| 黄梅县| 天津市| 多伦县|