官术网_书友最值得收藏!

Using Spark SQL for basic data analysis

Interactively, processing and visualizing large data is challenging as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning.

For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration.

In this section, we will go through some basic data exploration exercises to understand a sample Dataset. We will use a Dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We'll use the bank-additional-full.csv file that contains 41,188 records and 20 input fields, ordered by date (from May 2008 to November 2010). The Dataset has been contributed by S. Moro, P. Cortez, and P. Rita, and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

  1. As a first step, let's define a schema and read in the CSV file to create a DataFrame. You can use :paste command to paste initial set of statements in your Spark shell session (use Ctrl+D to exit the paste mode), as shown:
  1. After the DataFrame has been created, we first verify the number of records:
  1. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset, as follows:

In the next section, we will begin our data exploration by identifying missing data in our Dataset.

主站蜘蛛池模板: 长岭县| 马关县| 灵台县| 焦作市| 渝北区| 怀化市| 梨树县| 榆林市| 石棉县| 长汀县| 酉阳| 阿坝县| 抚顺市| 越西县| 和田县| 邻水| 克东县| 子洲县| 天气| 万宁市| 英超| 大庆市| 延川县| 东阳市| 荆门市| 福贡县| 大埔区| 拜泉县| 满城县| 永济市| 蒙阴县| 岳阳市| 施秉县| 新津县| 神木县| 汕头市| 广宁县| 油尖旺区| 内江市| 屯昌县| 松溪县|