- Feature Engineering Made Easy
- Sinan Ozdemir Divya Susarla
- 545字
- 2021-06-25 22:45:55
Mathematical operations allowed
We have a few new abilities to work with at the ordinal level compared to the nominal level. At the ordinal level, we may still do basic counts as we did at the nominal level, but we can also introduce comparisons and orderings into the mix. For this reason, we may utilize new graphs at this level. We may use bar and pie charts like we did at the nominal level, but because we now have ordering and comparisons, we can calculate medians and percentiles. With medians and percentiles, stem-and-leaf plots, as well as box plots, are possible.
Some examples of data at the ordinal level include:
Using a Likert scale (rating something on a scale from one to ten, for example)
Grade levels on an exam (F, D, C, B, A)
For a real-world example of data at the ordinal scale, let's bring in a new dataset. This dataset holds key insights into how much people enjoy the San Francisco International Airport or SFO. This dataset is also publicly available on SF's open database (https://data.sfgov.org/Transportation/2013-SFO-Customer-Survey/mjr8-p6m5):
# load in the data set
customer = pd.read_csv('../data/2013_SFO_Customer_survey.csv')
This CSV has many, many columns:
customer.shape
(3535, 95)
95 columns, to be exact. For more information on the columns available for this dataset, check out the data dictionary on the website (https://data.sfgov.org/api/views/mjr8-p6m5/files/FHnAUtMCD0C8CyLD3jqZ1-Xd1aap8L086KLWQ9SKZ_8?download=true&filename=AIR_DataDictionary_2013-SFO-Customer-Survey.pdf)
For now, let's focus on a single column, Q7A_ART. As described by the publicly available data dictionary, Q7A_ART is about artwork and exhibitions. The possible choices are 0, 1, 2, 3, 4, 5, 6 and each number has a meaning:
- 1: Unacceptable
- 2: Below Average
- 3: Average
- 4: Good
- 5: Outstanding
- 6: Have Never Used or Visited
- 0: Blank
We can represent it as follows:
art_ratings = customer['Q7A_ART']
art_ratings.describe()
count 3535.000000 mean 4.300707 std 1.341445 min 0.000000 25% 3.000000 50% 4.000000 75% 5.000000 max 6.000000 Name: Q7A_ART, dtype: float64
The pandas is considering the column numerical because it is full of numbers, however, we must remember that even though the cells' values are numbers, those numbers represent a category, and therefore this data belongs to the qualitative side, and more specifically, ordinal. If we remove the 0 and 6 category, we are left with five ordinal categories which basically resemble the star rating of restaurant ratings:
# only consider ratings 1-5
art_ratings = art_ratings[(art_ratings >=1) & (art_ratings <=5)]
We will then cast the values as strings:
# cast the values as strings
art_ratings = art_ratings.astype(str)
art_ratings.describe()
count 2656 unique 5 top 4 freq 1066 Name: Q7A_ART, dtype: object
Now that we have our ordinal data in the right format, let's look at some visualizations:
# Can use pie charts, just like in nominal level
art_ratings.value_counts().plot(kind='pie')
The following is the result of the preceding code:
We can also visualize this as a bar chart as follows:
# Can use bar charts, just like in nominal level
art_ratings.value_counts().plot(kind='bar')
The following is the output of the preceding code:
However, now we can also introduce box plots since we are at the ordinal level:
# Boxplots are available at the ordinal level
art_ratings.value_counts().plot(kind='box')
The following is the output of the preceding code:
This box plot would not be possible for the Grade column in the salary data, as finding a median would not be possible.
- 云數據中心基礎
- Python數據分析入門:從數據獲取到可視化
- 使用GitOps實現Kubernetes的持續部署:模式、流程及工具
- Modern Programming: Object Oriented Programming and Best Practices
- Dependency Injection with AngularJS
- Ceph源碼分析
- 大數據Hadoop 3.X分布式處理實戰
- 大數據治理與安全:從理論到開源實踐
- SIEMENS數控技術應用工程師:SINUMERIK 840D-810D數控系統功能應用與維修調整教程
- 數據指標體系:構建方法與應用實踐
- 云工作時代:科技進化必將帶來的新工作方式
- SOLIDWORKS 2018中文版機械設計基礎與實例教程
- MySQL核心技術手冊
- Kafka權威指南(第2版)
- Learning Construct 2