- The Data Analysis Workshop
- Gururajan Govindan Shubhangi Hora Konstantin Palagachev
- 339字
- 2021-06-18 18:18:26
Initial Analysis of the Reason for Absence
Let's start with a simple analysis of the Reason for absence column. We will try to address questions such as, what is the most common reason for absence? Does being a drinker or smoker have some effect on the causes? Does the distance to work have some effect on the reasons? And so on. Starting with these types of questions is often important when performing data analysis, as this is a good way to obtain confidence and understanding of the data.
The first thing we are interested in is the overall distribution of the absence reasons in the data—that is, how many entries we have for a specific reason for absence in our dataset. We can easily address this question by using the countplot() function from the seaborn package:
# get the number of entries for each reason for absence
plt.figure(figsize=(10, 5))
ax = sns.countplot(data=preprocessed_data, x="Reason for absence")
ax.set_ylabel("Number of entries per reason of absence")
plt.savefig('figs/absence_reasons_distribution.png', \
format='png', dpi=300)
The output will be as follows:

Figure 2.6: Number of entries for all reasons for absence
Note that we also used the Disease column as the hue parameter. This helps us to distinguish between disease-related reasons (listed in the ICD encoding) and those that aren't. Following Figure 2.3, we can assert that the most frequent reasons for absence are related to medical consultations (23), dental consultations (28), and physiotherapy (27). On the other hand, the most frequent reasons for absence encoded in the ICD encoding are related to diseases of the musculoskeletal system and connective tissue (13) and injury, poisoning, and certain other consequences of external causes (19).
In order to perform a more accurate and in-depth analysis of the data, we will investigate the impact of the various features on the Reason for absence and Absenteeism in hours columns in the following sections. First, we will analyze the data on social drinkers and smokers in the next section.
- Flask Blueprints
- C++程序設計(第3版)
- Getting started with Google Guava
- 深入淺出Android Jetpack
- Python高效開發實戰:Django、Tornado、Flask、Twisted(第3版)
- 微信小程序開發解析
- Learning Hunk
- INSTANT Yii 1.1 Application Development Starter
- Visual Basic程序設計習題與上機實踐
- Essential C++(中文版)
- Red Hat Enterprise Linux Troubleshooting Guide
- 大學計算機應用基礎(Windows 7+Office 2010)(IC3)
- WordPress Search Engine Optimization(Second Edition)
- 從零開始學算法:基于Python
- 大話程序員:從入門到優秀全攻略