- Learning Spark SQL
- Aurobindo Sarkar
- 209字
- 2021-07-02 18:23:52
Analyzing missing data
If we wanted to get a sense of the number of rows containing one or more missing fields in the RDD, we can create a RDD with these rows:

We can also do the same, if our data was available in a DataFrame as shown:

A quick check of the Dataset reveals that most of the rows with missing data also have missing values for the Events and Max Gust Speed Km/h columns. Filtering on these two column values actually, captures all the rows with missing field values. It also matches the results for missing values in the RDD.

As there are many rows that contain one or more missing fields, we choose to retain these rows to ensure we do not lose valuable information. In the following function, we insert 0 in all the missing fields of an RDD.

We can replace 0 inserted in the previous step with an NA in the string fields, as follows:

At this stage, we can combine the rows of the four Datasets into a single Dataset using the union operation.

At this stage, the processing of our second Dataset containing weather data is complete. In the next section, we combine these pre-processed Datasets using a join operation.
- 軟件安全技術(shù)
- Java異步編程實(shí)戰(zhàn)
- Vue.js前端開發(fā)基礎(chǔ)與項(xiàng)目實(shí)戰(zhàn)
- Java加密與解密的藝術(shù)
- Mastering Apache Spark 2.x(Second Edition)
- 深入理解Elasticsearch(原書第3版)
- Python Web數(shù)據(jù)分析可視化:基于Django框架的開發(fā)實(shí)戰(zhàn)
- Odoo 10 Implementation Cookbook
- 網(wǎng)絡(luò)數(shù)據(jù)采集技術(shù):Java網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)
- Advanced Python Programming
- Python Programming for Arduino
- Spring Data JPA從入門到精通
- 從零開始學(xué)Unity游戲開發(fā):場景+角色+腳本+交互+體驗(yàn)+效果+發(fā)布
- Three.js Essentials
- FORTRAN程序設(shè)計(jì)權(quán)威指南