官术网_书友最值得收藏!

Analyzing missing data

If we wanted to get a sense of the number of rows containing one or more missing fields in the RDD, we can create a RDD with these rows:

We can also do the same, if our data was available in a DataFrame as shown:

A quick check of the Dataset reveals that most of the rows with missing data also have missing values for the Events and Max Gust Speed Km/h columns. Filtering on these two column values actually, captures all the rows with missing field values. It also matches the results for missing values in the RDD.

As there are many rows that contain one or more missing fields, we choose to retain these rows to ensure we do not lose valuable information. In the following function, we insert 0 in all the missing fields of an RDD.

We can replace 0 inserted in the previous step with an NA in the string fields, as follows:

At this stage, we can combine the rows of the four Datasets into a single Dataset using the union operation.

At this stage, the processing of our second Dataset containing weather data is complete. In the next section, we combine these pre-processed Datasets using a join operation.

主站蜘蛛池模板: 兴国县| 台北县| 吐鲁番市| 琼海市| 甘德县| 威远县| 错那县| 内黄县| 镇巴县| 呼玛县| 大邑县| 姜堰市| 从化市| 邹平县| 垫江县| 黑河市| 广汉市| 龙泉市| 北海市| 二连浩特市| 全州县| 彭阳县| 清河县| 大渡口区| 岳普湖县| 湟中县| 大港区| 皮山县| 砚山县| 沙坪坝区| 沁阳市| 丰镇市| 荃湾区| 九龙县| 六安市| 罗源县| 新竹县| 陕西省| 辰溪县| 福州市| 尼勒克县|