官术网_书友最值得收藏!

  • Python Data Analysis
  • Ivan Idris
  • 529字
  • 2021-08-05 17:31:54

Basic descriptive statistics with NumPy

In this book, we will try to use as many varied datasets as possible. This depends on the availability of the data. Unfortunately, this means that the subject of the data might not exactly match your interests. Every dataset has its own quirks, but the general skills you acquire in this book should transfer to your own field. In this chapter, we will load a number of Comma-separated Value (CSV) files into NumPy arrays in order to analyze the data.

To load the data, we will use the NumPy loadtxt() function as follows:

Note

The code for this example can be found in basic_stats.py in the code bundle.

import numpy as np
from scipy.stats import scoreatpercentile

data = np.loadtxt("mdrtb_2012.csv", delimiter=',', usecols=(1,), skiprows=1, unpack=True)

print "Max method", data.max()
print "Max function", np.max(data)

print "Min method", data.min()
print "Min function", np.min(data)

print "Mean method", data.mean()
print "Mean function", np.mean(data)

print "Std method", data.std()
print "Std function", np.std(data)

print "Median", np.median(data)
print "Score at percentile 50", scoreatpercentile(data, 50)

Next, we will compute the mean, median, maximum, minimum, and standard deviations of a NumPy array.

Note

If these terms sound unfamiliar to you, please take some time to learn about them from Wikipedia or any other source. As mentioned in the Preface, we will assume familiarity with basic mathematical and statistical concepts.

The data comes from the mdrtb_2012.csv file, which can be found in the code bundle. This is an edited version of the CSV file, which can be downloaded from the WHO website at https://extranet.who.int/tme/generateCSV.asp?ds=mdr_estimates. It contains data about a type of tuberculosis. The file we are going to use is a reduced version of the original file containing only two columns: the country and percentage of new cases. Here are the first two lines of the file:

country,e_new_mdr_pcnt
Afghanistan,3.5

Now, let's compute the mean, median, maximum, minimum, and standard deviations of a NumPy array:

  1. First, we will load the data with the following function call:
    data = np.loadtxt("mdrtb_2012.csv", delimiter=',', usecols=(1,), skiprows=1, unpack=True)

    In the preceding call, we specify a comma as a delimiter, the second column to load data from, and that we want to skip the header. We also specify the name of the file and assume that the file is in the current directory; otherwise, we will have to specify the correct path.

  2. The maximum of an array can be obtained via a method of the ndarray and NumPy functions. The same goes for the minimum, mean, and standard deviations. The following code snippet prints the various statistics:
    print "Max method", data.max()
    print "Max function", np.max(data)
    
    print "Min method", data.min()
    print "Min function", np.min(data)
    
    print "Mean method", data.mean()
    print "Mean function", np.mean(data)
    
    print "Std method", data.std()
    print "Std function", np.std(data)

    The output is as follows:

    Max method 50.0
    Max function 50.0
    Min method 0.0
    Min function 0.0
    Mean method 3.2787037037
    Mean function 3.2787037037
    Std method 5.76332073654
    Std function 5.76332073654
    
  3. The median can be retrieved with a NumPy or SciPy function, which can estimate the 50th percentile of the data with the following lines:
    print "Median", np.median(data)
    print "Score at percentile 50", scoreatpercentile(data, 50)

    The following is printed:

    Median 1.8
    Score at percentile 50 1.8
    
主站蜘蛛池模板: 紫阳县| 望奎县| 积石山| 鲜城| 永顺县| 康保县| 崇左市| 赣榆县| 伊通| 四会市| 武清区| 五大连池市| 灵丘县| 沁水县| 简阳市| 龙胜| 云阳县| 尉氏县| 噶尔县| 达孜县| 如皋市| 垫江县| 伽师县| 河曲县| 南通市| 上杭县| 县级市| 兰州市| 奉化市| 钟山县| 鹤壁市| 东乡县| 吐鲁番市| 黑龙江省| 临江市| 杭锦后旗| 中牟县| 招远市| 额济纳旗| 宝山区| 潞城市|