官术网_书友最值得收藏!

Discriminant analysis application

LDA is performed in the MASS package, which we have already loaded so that we can access the biopsy data. The syntax is very similar to the lm() and glm() functions. 

We can now begin fitting our LDA model, which is as follows:

    > lda.fit <- lda(class ~ ., data = train)
> lda.fit
Call:
lda(class ~ ., data = train)
Prior probabilities of groups:
benign malignant
0.6371308 0.3628692
Group means:
thick u.size u.shape adhsn s.size nucl
chrom

benign 2.9205 1.30463 1.41390 1.32450 2.11589
1.39735 2.08278

malignant 7.1918 6.69767 6.68604 5.66860 5.50000
7.67441 5.95930

n.nuc mit
benign 1.22516 1.09271
malignant 5.90697 2.63953
Coefficients of linear discriminants:
LD1
thick 0.19557291
u.size 0.10555201
u.shape 0.06327200
adhsn 0.04752757
s.size 0.10678521
nucl 0.26196145
chrom 0.08102965
n.nuc 0.11691054
mit -0.01665454

This output shows us that Prior probabilities of groups are approximately 64 percent for benign and 36 percent for malignancy. Next is Group means. This is the average of each feature by their class. Coefficients of linear discriminants are the standardized linear combination of the features that are used to determine an observation's discriminant score. The higher the score, the more likely that the classification is malignant.

The plot() function in LDA will provide us with a histogram and/or the densities of the discriminant scores, as follows:

    > plot(lda.fit, type = "both")

The following is the output of the preceding command:

We can see that there is some overlap in the groups, indicating that there will be some incorrectly classified observations.

The predict() function available with LDA provides a list of three elements: class, posterior, and x. The class element is the prediction of benign or malignant, the posterior is the probability score of x being in each class, and x is the linear discriminant score. Let's just extract the probability of an observation being malignant:

    > train.lda.probs <- predict(lda.fit)$posterior[, 
2]

> misClassError(trainY, train.lda.probs)
[1] 0.0401
> confusionMatrix(trainY, train.lda.probs)
0 1
0 296 13
1 6 159

Well, unfortunately, it appears that our LDA model has performed much worse than the logistic regression models. The primary question is to see how this will perform on the test data:

    > test.lda.probs <- predict(lda.fit, newdata = 
test)$posterior[, 2]
> misClassError(testY, test.lda.probs)
[1] 0.0383
> confusionMatrix(testY, test.lda.probs)
0 1
0 140 6
1 2 61

That's actually not as bad as I thought, given the lesser performance on the training data. From a correctly classified perspective, it still did not perform as well as logistic regression (96 percent versus almost 98 percent with logistic regression).

We will now move on to fit a QDA model. In R, QDA is also part of the MASS package and the function is qda(). Building the model is rather straightforward again, and we will store it in an object called qda.fit, as follows:

    > qda.fit = qda(class ~ ., data = train) 
> qda.fit
Call:
qda(class ~ ., data = train)
Prior probabilities of groups:
benign malignant
0.6371308 0.3628692
Group means:
Thick u.size u.shape adhsn s.size nucl chrom
n.nuc

benign 2.9205 1.3046 1.4139 1.3245 2.1158
1.3973 2.0827 1.2251

malignant 7.1918 6.6976 6.6860 5.6686 5.5000
7.6744 5.9593 5.9069

mit
benign 1.092715
malignant 2.639535

As with LDA, the output has Group means but does not have the coefficients because it is a quadratic function as discussed previously.

The predictions for the train and test data follow the same flow of code as with LDA:

    > train.qda.probs <- predict(qda.fit)$posterior[,          
2]

> misClassError(trainY, train.qda.probs)
[1] 0.0422
> confusionMatrix(trainY, train.qda.probs)
0 1
0 287 5
1 15 167
> test.qda.probs <- predict(qda.fit, newdata =
test)$posterior[, 2]

> misClassError(testY, test.qda.probs)
[1] 0.0526
> confusionMatrix(testY, test.qda.probs)
0 1
0 132 1
1 10 66

We can quickly tell that QDA has performed the worst on the training data with the confusion matrix, and it has classified the test set poorly with 11 incorrect predictions. In particular, it has a high rate of false positives.

主站蜘蛛池模板: 五莲县| 易门县| 嘉荫县| 当雄县| 德格县| 上虞市| 昌吉市| 新巴尔虎左旗| 蓝田县| 柳江县| 留坝县| 金山区| 兴宁市| 丽江市| 镇江市| 滨海县| 石首市| 都匀市| 黄大仙区| 湛江市| 元谋县| 南开区| 安远县| 延长县| 龙州县| 巫溪县| 孝昌县| 天全县| 梁平县| 蓬溪县| 沈丘县| 龙口市| 五常市| 和硕县| 正阳县| 关岭| 卢氏县| 莲花县| 英德市| 高安市| 澄城县|