In this book, we will try to use as many varied datasets as possible. This depends on the availability of the data. Unfortunately, this means that the subject of the data might not exactly match your interests. Every dataset has its own quirks, but the general skills you acquire in this book should transfer to your own field. In this chapter, we will load a number of Comma-separated Value (CSV) files into NumPy arrays in order to analyze the data.
To load the data, we will use the NumPy loadtxt() function as follows:
Note
The code for this example can be found in basic_stats.py in the code bundle.
Next, we will compute the mean, median, maximum, minimum, and standard deviations of a NumPy array.
Note
If these terms sound unfamiliar to you, please take some time to learn about them from Wikipedia or any other source. As mentioned in the Preface, we will assume familiarity with basic mathematical and statistical concepts.
The data comes from the mdrtb_2012.csv file, which can be found in the code bundle. This is an edited version of the CSV file, which can be downloaded from the WHO website at https://extranet.who.int/tme/generateCSV.asp?ds=mdr_estimates. It contains data about a type of tuberculosis. The file we are going to use is a reduced version of the original file containing only two columns: the country and percentage of new cases. Here are the first two lines of the file:
country,e_new_mdr_pcntAfghanistan,3.5
Now, let's compute the mean, median, maximum, minimum, and standard deviations of a NumPy array:
First, we will load the data with the following function call:
data = np.loadtxt("mdrtb_2012.csv", delimiter=',', usecols=(1,), skiprows=1, unpack=True)
In the preceding call, we specify a comma as a delimiter, the second column to load data from, and that we want to skip the header. We also specify the name of the file and assume that the file is in the current directory; otherwise, we will have to specify the correct path.
The maximum of an array can be obtained via a method of the ndarray and NumPy functions. The same goes for the minimum, mean, and standard deviations. The following code snippet prints the various statistics:
Max method 50.0Max function 50.0Min method 0.0Min function 0.0Mean method 3.2787037037Mean function 3.2787037037Std method 5.76332073654Std function 5.76332073654
The median can be retrieved with a NumPy or SciPy function, which can estimate the 50th percentile of the data with the following lines:
print "Median", np.median(data)
print "Score at percentile 50", scoreatpercentile(data, 50)