- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 715字
- 2021-07-02 20:09:26
Chi-square tests
The chi-square test is a statistical test commonly used to compare observed data with the expected data assuming that the data follows a certain hypothesis. In a sense, this is also a hypothesis test. You assume one hypothesis, which your data will follow and calculate the expected data according to that hypothesis. You already have the observed data. You calculate the deviation between the observed and expected data using the statistics defined in the following formula:

Where O is the observed value and E is the expected value while the summation is over all the data points.
The chi-square test can be used to do the following things:
- Show a causal relationship or independence between one input and output variable. We assume that they are independent and calculate the expected values. Then we calculate the chi-square value. If the null hypothesis is rejected, it suggests a relationship between the two variables. The relationship is not just by chance but statistically proven.
- Check whether the observed data is coming from a fair/unbiased source. If the observed data is more skewed towards one extreme, compared to the expected data, then it is not coming from a fair source. But, if it is very close to the expected value then it is.
- Check whether a data is too good to be true. As, it is a random experiment and we don't expect the values to toe the assumed hypothesis. If they do toe the assumed hypothesis, then the data has probably been tampered to make it look good and is too good to be true.
Let us create a hypothetical experiment where a coin is tossed 10 times. How many times do you expect it to turn heads or tails? Five, right? Now, what if we do this experiment 1000 times and record the scores (number of heads and tails). Suppose we observed heads 553 times and a tails in the rest of the trials:


Let us calculate the chi-square value:

This chi-square value is compared to the value on a chi-square distribution for a given degree of freedom and a given significance level. The degrees of freedom is the number of categories -1. In this case, it is 2-1=1. Let us assume a significance level of 0.05.
The chi-square distribution looks a little different than the normal distribution. It also has a peak but has a much longer tail than the normal distribution and is only on one side. As the degree of freedom increases, they start looking similar to a normal distribution:

Fig. 4.6: Chi-square distribution with different degrees of freedom
When we look at the chi-square distribution table for a degree of freedom 1 and a significance level of 0.05, we get a value of 3.841. At a significance level of 0.01, we get 6.635. In both the cases, the chi-square statistic is greater than the value from the chi-square distribution, meaning that the chi-square statistic lies on the right of the value from the distribution table.
Hence, the null hypothesis is rejected. That means that the coin is not fair.

Fig. 4.7: Null hypothesis is rejected because the value of the chi-square statistic at the significance level is less than the value of the chi-square statistic
Let us look at another example where we want to prove that the gender of a student and the subjects they choose are independent.
Suppose, in a group of students, the following table represents the number of boys and girls who have taken Maths, Arts, and Commerce, as their main subjects.
The observed number of boys and girls in each subject is as shown in the following table:

On calculating and summing up all the values, the chi-square value comes out to be 5.05. The degree of freedom is the number of categories-1, which amounts to [(3x2)-1=5]. Let us assume a significance level of 0.05.
Looking at the chi-square distribution, one can find out that for a 5-degree freedom chi-square distribution, the value of the chi-square statistic at a significance level of 0.05 is 11.07.
The calculated chi-square statistic < chi-square statistic (at significance level=0.05).
Since, the chi-square statistic lies on the left of the value at the significance level, the null hypothesis can't be rejected. Hence, the choice of subjects is independent of the gender.
- Python絕技:運用Python成為頂級數據工程師
- Hands-On Machine Learning with Microsoft Excel 2019
- 算法競賽入門經典:習題與解答
- 虛擬化與云計算
- Python廣告數據挖掘與分析實戰
- 分布式數據庫系統:大數據時代新型數據庫技術(第3版)
- 文本挖掘:基于R語言的整潔工具
- SQL查詢:從入門到實踐(第4版)
- R數據科學實戰:工具詳解與案例分析(鮮讀版)
- UDK iOS Game Development Beginner's Guide
- 數據要素五論:信息、權屬、價值、安全、交易
- 數據架構與商業智能
- Hadoop 3.x大數據開發實戰
- “互聯網+”時代立體化計算機組
- 大數據架構商業之路:從業務需求到技術方案