- The Data Wrangling Workshop
- Brian Lipp Shubhadeep Roychowdhury Dr. Tirthajyoti Sarkar
- 823字
- 2021-06-18 18:11:52
Statistics and Visualization with NumPy and Pandas
One of the great advantages of using libraries such as NumPy and pandas is that a plethora of built-in statistical and visualization methods are available, for which we don't have to search for and write new code. Furthermore, most of these subroutines are written using C or Fortran code (and pre-compiled), making them extremely fast to execute.
Refresher on Basic Descriptive Statistics
For any data wrangling task, it is quite useful to extract basic descriptive statistics, which should describe the data in ways such as the mean, median, and mode and create some simple visualizations or plots. These plots are often the first step in identifying fundamental patterns as well as oddities (if present) in the data. In any statistical analysis, descriptive statistics is the first step, followed by inferential statistics, which tries to infer the underlying distribution or process that the data might have been generated from. You can imagine that descriptive statistics will inform us of the basic characteristics of the data, while inferential statistics will help us understand not only the data we are working with but alternative data that we might be experimenting with.
Since inferential statistics is intimately coupled with the machine learning/predictive modeling stage of a data science pipeline, descriptive statistics naturally becomes associated with the data wrangling aspect.
There are two broad approaches to descriptive statistical analysis:
- Graphical techniques: Bar plots, scatter plots, line charts, box plots, histograms, and so on
- The calculation of the central tendency and spread: Mean, median, mode, variance, standard deviation, range, and so on
In this section, we will demonstrate how you can accomplish both of these tasks using Python. Apart from NumPy and pandas, we will need to learn the basics of another great package – matplotlib – which is the most powerful and versatile visualization library in Python.
Exercise 3.17: Introduction to Matplotlib through a Scatter Plot
In this exercise, we will demonstrate the power and simplicity of matplotlib by creating a simple scatter plot from self-created data about the age, weight, and height of a few people. To do so, let's go through the following steps:
- First, we will define simple lists of the names of people, along with their age, weight (in kgs), and height (in centimeters):
people = ['Ann','Brandon','Chen','David','Emily',\
'Farook','Gagan','Hamish','Imran',\
'Joseph','Katherine','Lily']
age = [21,12,32,45,37,18,28,52,5,40,48,15]
weight = [55,35,77,68,70,60,72,69,18,65,82,48]
height = [160,135,170,165,173,168,175,159,105,\
171,155,158]
- Import the most important module from matplotlib, called pyplot:
import matplotlib.pyplot as plt
- Create simple scatter plots of age versus weight:
plt.scatter(age,weight)
plt.show()
The output is as follows:
Figure 3.20: A screenshot of a scatter plot containing age and weight
The preceding plot can be improved by enlarging the figure size, customizing the aspect ratio, adding a title with a proper font size, adding x-axis and y-axis labels with a customized font size, adding grid lines, changing the y-axis limit to be between 0 and 100, adding x and y tick marks, customizing the scatter plot's color, and changing the size of the scatter dots.
- The code for the improved plot is as follows:
plt.figure(figsize=(8,6))
plt.title("Plot of Age vs. Weight (in kgs)",\
fontsize=20)
plt.xlabel("Age (years)",fontsize=16)
plt.ylabel("Weight (kgs)",fontsize=16)
plt.grid (True)
plt.ylim(0,100)
plt.xticks([i*5 for i in range(12)],fontsize=15)
plt.yticks(fontsize=15)
plt.scatter(x=age,y=weight,c='orange',s=150,\
edgecolors='k')
plt.text(x=20,y=85,s="Weights after 18-20 years of age",\
fontsize=15)
plt.vlines(x=20,ymin=0,ymax=80,linestyles='dashed',\
color=?blue?,lw=3)
plt.legend([?Weight in kgs?],loc=2,fontsize=12)
plt.show()
The output is as follows:
Figure 3.21: A screenshot of a scatter plot showing age versus weight
We can observe the following things:
- A tuple (8,6) is passed as an argument for the figure size.
- A list comprehension is used inside xticks to create a customized list of 5-10-15-…-55.
- A newline (\n) character is used inside the plt.text() function to break up and distribute the text into two lines.
- The plt.show() function is used at the very end. The idea is to keep on adding various graphics properties (font, color, axis limits, text, legend, grid, and so on) until you are satisfied and then show the plot with one function. The plot will not be displayed without this last function call.
The preceding plot is quite self-explanatory. We can observe that the variations in weight are reduced after 18-20 years of age.
Note
To access the source code for this specific section, please refer to https://packt.live/3hFzysK.
You can also run this example online at https://packt.live/3eauxWP.
In this exercise, we have gone through the basics of using matplotlib, a popular charting function. In the next section, we will look at the definition of statistical measures.
- Visual C++實例精通
- Spring Cloud、Nginx高并發核心編程
- Learning Neo4j 3.x(Second Edition)
- 單片機C語言程序設計實訓100例
- Go語言開發實戰(慕課版)
- SSH框架企業級應用實戰
- Java自然語言處理(原書第2版)
- PHP動態網站開發實踐教程
- Android 5從入門到精通
- Visual FoxPro數據庫程序設計
- Selenium Essentials
- 數據庫技術及應用教程上機指導與習題(第2版)
- 區塊鏈原理、架構與應用(第2版)
- Learning PowerShell DSC(Second Edition)
- Learning Predictive Analytics with R