書名： Learning Spark SQL
作者名： Aurobindo Sarkar
本章字數： 122字
更新時間： 2021-07-02 18:23:50

Pre-processing of the household electric consumption Dataset

Create a case class for household electric power consumption called HouseholdEPC:

Read the input Dataset into a RDD and count the number of rows in it.

Next, remove the header and all other rows containing missing values, (represented as ?'s in the input), as shown in the following steps:

In the next step, convert the RDD [String] to a RDD with the case class, we defined earlier, and convert the RDD a DatFrame of HouseholdEPC objects.

Display a few sample records in the DataFrame, and count the number of rows in it to verify that the number of rows in the DataFrame matches the expected number of rows in your input Dataset.

官术网_书友最值得收藏!

Learning Spark SQL

Pre-processing of the household electric consumption Dataset