官术网_书友最值得收藏!

  • Learning Spark SQL
  • Aurobindo Sarkar
  • 181字
  • 2021-07-02 18:23:46

Identifying missing data

Missing data can occur in Datasets due to reasons ranging from negligence to a refusal on the part of respondants to provide a specific data point. However, in all cases, missing data is a common occurrence in real-world Datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies to deal with it.

In this section, we analyze the numbers of records with missing data fields in our sample Dataset. In order to simulate missing data, we will edit our sample Dataset by replacing fields containing "unknown" values with empty strings.

First, we created a DataFrame/Dataset from our edited file, as shown:

The following two statements give us a count of rows with certain fields having missing data:

In Chapter 4, Using Spark SQL for Data Munging, we will look at effective ways of dealing with missing data. In the next section, we will compute some basic statistics for our sample Dataset to improve our understanding of the data.

主站蜘蛛池模板: 青冈县| 乌兰察布市| 兴山县| 海原县| 驻马店市| 汝阳县| 甘谷县| 潮州市| 三原县| 沧源| 威远县| 固阳县| 凭祥市| 韶关市| 泽库县| 濮阳县| 上林县| 永寿县| 昌江| 南澳县| 梁平县| 禹州市| 莱西市| 凤凰县| 正安县| 芦溪县| 乐业县| 松滋市| 云和县| 西丰县| 河津市| 房产| 偏关县| 普格县| 贡嘎县| 滨州市| 陵川县| 措美县| 泗洪县| 丰宁| 孝昌县|