官术网_书友最值得收藏!

Computing basic statistics

Now we can use the DataFrame that we created to get some basic numbers; we will use the following steps to do so:

  1. We can count how much data there is through the count() method, as shown in the following screenshot:

We can see that there are 150 observations. Note that this excludes NA values (that is, missing values), so it is possible that not all of these observations will be 150.

  1. We can also compute the sample mean, which is the arithmetic average of all the numbers in the dataset, by simply calling the mean() method, as shown in the following screenshot:

Here, we can see the arithmetic means for the numeric columns. The sample mean can also be calculated arithmetically, using the following formula:

  1. Next, we can compute the sample median using the median() method:

Here, we can see the median values; the sample median is the middle data point, which we get after ordering the dataset. It can be computed arithmetically by using the following formula:

Here, x(n) represents ordered data.

  1. We can compute the variance as follows:

The sample variance is a measure of dispersion and is roughly the average squared distance of a data point from the mean. It can be calculated arithmetically, as follows:

  1. The most interesting quantity is the sample standard deviation, which is the square root of the variance. It is computed as follows:

The standard deviation is the square root of the variance and is interpreted as the average distance that a data point is from the mean. It can be represented arithmetically, as follows:

  1. We can also compute percentiles; we do that by defining the value of the percentile that you want to see using the following command:
iris.quantile(.p)

So, here, roughly p% of the data is less than that percentile.

  1. Let's find out the 1st, 3rd, 10th, and 95th percentiles as an example, as follows:
  1. Now, we will compute the interquartile range (IQR) between the 3rd and 1st quantile using the following function:
  1. Other interesting quantities include the maximum value of the dataset, and the minimum value of the dataset. Both of these values can be computed as follows:

Most of the methods mentioned here also work with grouped data. As an exercise, try summarizing the data that we grouped in the previous section, using the previous methods.

  1. Another useful method includes the describe() method. This method can be useful if all you want is just a basic statistical summary of the dataset:

Note that this method includes the count, mean, standard deviations, the five-number summaryfrom the minimum to the maximumand the quantiles in between. This will also work for grouped data. As an exercise, why don't you try finding the summary of the grouped data?

  1. Now, if we want a custom numerical summary, then we can write a function that will work for a pandas series, and then apply that to the columns of a DataFrame. For example, there isn't a function that computes the range of a dataset, which is the difference between the maximum and the minimum of the dataset. So, we will define a function that can compute the range if it were given a pandas series; here, you can see that by sending it to apply(), you get the ranges that you want:

Notice that I was more selective in choosing columns in terms of which columns to work with. Previously, a lot of the methods were able to weed out columns that weren't numeric; however, to use apply(), you need to specifically select the columns that are numeric, otherwise, you may end up with an error.

  1. We can't directly use the preceding code if we want to filter for grouped data. Instead, we can use the .aggregate() method, as follows:

Thus, we have learned all about computing various statistics using the methods present in pandas. In the next section, we will look at classical statistical inference, specifically with inference for a population proportion.

主站蜘蛛池模板: 嘉义县| 安仁县| 长兴县| 镇康县| 北安市| 枞阳县| 昌宁县| 阳原县| 监利县| 内江市| 舟曲县| 尼勒克县| 方山县| 湛江市| 疏勒县| 锡林浩特市| 桐庐县| 横山县| 旺苍县| 布尔津县| 大田县| 黄平县| 灌云县| 竹溪县| 广西| 五大连池市| 土默特左旗| 嘉义县| 清水县| 资源县| 辽中县| 合川市| 通化县| 兴安县| 益阳市| 敦化市| 成武县| 波密县| 涿州市| 仁怀市| 丹江口市|