書名： Feature Engineering Made Easy
作者名： Sinan Ozdemir Divya Susarla
本章字數： 545字
更新時間： 2021-06-25 22:45:55

Mathematical operations allowed

We have a few new abilities to work with at the ordinal level compared to the nominal level. At the ordinal level, we may still do basic counts as we did at the nominal level, but we can also introduce comparisons and orderings into the mix. For this reason, we may utilize new graphs at this level. We may use bar and pie charts like we did at the nominal level, but because we now have ordering and comparisons, we can calculate medians and percentiles. With medians and percentiles, stem-and-leaf plots, as well as box plots, are possible.

Some examples of data at the ordinal level include:

Using a Likert scale (rating something on a scale from one to ten, for example)
Grade levels on an exam (F, D, C, B, A)

For a real-world example of data at the ordinal scale, let's bring in a new dataset. This dataset holds key insights into how much people enjoy the San Francisco International Airport or SFO. This dataset is also publicly available on SF's open database (https://data.sfgov.org/Transportation/2013-SFO-Customer-Survey/mjr8-p6m5):

# load in the data set
customer = pd.read_csv('../data/2013_SFO_Customer_survey.csv')

This CSV has many, many columns:

customer.shape

(3535, 95)

95 columns, to be exact. For more information on the columns available for this dataset, check out the data dictionary on the website (https://data.sfgov.org/api/views/mjr8-p6m5/files/FHnAUtMCD0C8CyLD3jqZ1-Xd1aap8L086KLWQ9SKZ_8?download=true&filename=AIR_DataDictionary_2013-SFO-Customer-Survey.pdf)

For now, let's focus on a single column, Q7A_ART. As described by the publicly available data dictionary, Q7A_ART is about artwork and exhibitions. The possible choices are 0, 1, 2, 3, 4, 5, 6 and each number has a meaning:

1: Unacceptable
2: Below Average
3: Average
4: Good
5: Outstanding
6: Have Never Used or Visited
0: Blank

We can represent it as follows:

art_ratings = customer['Q7A_ART']
art_ratings.describe()


count    3535.000000
mean        4.300707
std         1.341445
min         0.000000
25%         3.000000
50%         4.000000
75%         5.000000
max         6.000000
Name: Q7A_ART, dtype: float64

The pandas is considering the column numerical because it is full of numbers, however, we must remember that even though the cells' values are numbers, those numbers represent a category, and therefore this data belongs to the qualitative side, and more specifically, ordinal. If we remove the 0 and 6 category, we are left with five ordinal categories which basically resemble the star rating of restaurant ratings:

# only consider ratings 1-5
art_ratings = art_ratings[(art_ratings >=1) & (art_ratings <=5)]

We will then cast the values as strings:

# cast the values as strings
art_ratings = art_ratings.astype(str)

art_ratings.describe()

count     2656
unique       5
top          4
freq      1066
Name: Q7A_ART, dtype: object

Now that we have our ordinal data in the right format, let's look at some visualizations:

# Can use pie charts, just like in nominal level
art_ratings.value_counts().plot(kind='pie')

The following is the result of the preceding code:

We can also visualize this as a bar chart as follows:

# Can use bar charts, just like in nominal level
art_ratings.value_counts().plot(kind='bar')

The following is the output of the preceding code:

However, now we can also introduce box plots since we are at the ordinal level:

# Boxplots are available at the ordinal level
art_ratings.value_counts().plot(kind='box')

The following is the output of the preceding code:

This box plot would not be possible for the Grade column in the salary data, as finding a median would not be possible.

官术网_书友最值得收藏!

Feature Engineering Made Easy

Mathematical operations allowed