So far, we have seen data taken from multiple individuals but at one point in time (cross-sectional) or taken from an individual entity but over multiple points in time (time series). However, if we observe multiple entities over multiple points in time we get a panel data also known as longitudinal data. Extending our earlier example about the military expenditure, let us now consider four countries over the same period of 1960-2010. The resulting data will be a panel dataset. The figure given below illustrates the panel data in this scenario. Rows with missing values, corresponding to the period 1960 to 1987 have been dropped before plotting the data.
Figure 1.4: Example of panel data
A generic panel data regression model can be stated as y_it = W x _it +b+ ? _it, which expresses the dependent variable y_it as a linear model of explanatory variable x_it, where W are weights of x_it, b is the bias term, and ?_it is the error. i represents individuals for whom data is collected for multiple points in time represented by j. As evident, this type of panel data analysis seeks to model the variations across both multiple individual and multiple points in time. The variations are reflected by ? _it and assumptions determine the necessary mathematical treatment. For example, if ?_it is assumed to vary non-stochastically with respect to i and t, then it reduces to a dummy variable representing random noise. This type of analysis is known as fixed effects model. On the other hand, ?_it varying stochastically over i and t requires a special treatment of the error and is dealt in a random effects model.
Let us prepare the data that is required to plot the preceding figure. We will continue to expand the code we have used for the cross-sectional and time series data previously in this chapter. We start by creating a DataFrame having the data for the four companies mentioned in the preceding plot. This is done as follows:
Now that the data is ready for all five countries, we will plot them using the following code:
plt.figure(figsize=(5.5, 5.5)) usa.plot(linestyle='-', marker='*', color='b') chn.plot(linestyle='-', marker='*', color='r') gbr.plot(linestyle='-', marker='*', color='g') ind.plot(linestyle='-', marker='*', color='y') plt.legend(['USA','CHINA','UK','INDIA'], loc=1) plt.title('Miltitary expenditure of 5 countries over 10 years') plt.ylabel('Military expenditure (% of GDP)') plt.xlabel('Years')s
The Jupyter notebook that has the code used for generating all the preceding figures is Chapter_1_Different_Types_of_Data.ipynb under the code folder in the GitHub repo.
The discussion about different types of data sets the stage for a closer look at time series. We will start doing that by understanding the special properties of data that can be typically found in a time series or panel data with inherent time series in it.