書名： Learning Spark SQL
作者名： Aurobindo Sarkar
本章字?jǐn)?shù)： 209字
更新時(shí)間： 2021-07-02 18:23:52

Analyzing missing data

If we wanted to get a sense of the number of rows containing one or more missing fields in the RDD, we can create a RDD with these rows:

We can also do the same, if our data was available in a DataFrame as shown:

A quick check of the Dataset reveals that most of the rows with missing data also have missing values for the Events and Max Gust Speed Km/h columns. Filtering on these two column values actually, captures all the rows with missing field values. It also matches the results for missing values in the RDD.

As there are many rows that contain one or more missing fields, we choose to retain these rows to ensure we do not lose valuable information. In the following function, we insert 0 in all the missing fields of an RDD.

We can replace 0 inserted in the previous step with an NA in the string fields, as follows:

At this stage, we can combine the rows of the four Datasets into a single Dataset using the union operation.

At this stage, the processing of our second Dataset containing weather data is complete. In the next section, we combine these pre-processed Datasets using a join operation.

官术网_书友最值得收藏!

Learning Spark SQL

Analyzing missing data