- Learning Spark SQL
- Aurobindo Sarkar
- 340字
- 2021-07-02 18:23:46
Using Spark SQL for basic data analysis
Interactively, processing and visualizing large data is challenging as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning.
For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration.
In this section, we will go through some basic data exploration exercises to understand a sample Dataset. We will use a Dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We'll use the bank-additional-full.csv file that contains 41,188 records and 20 input fields, ordered by date (from May 2008 to November 2010). The Dataset has been contributed by S. Moro, P. Cortez, and P. Rita, and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.
- As a first step, let's define a schema and read in the CSV file to create a DataFrame. You can use :paste command to paste initial set of statements in your Spark shell session (use Ctrl+D to exit the paste mode), as shown:

- After the DataFrame has been created, we first verify the number of records:

- We can also define a case class called Call for our input records, and then create a strongly-typed Dataset, as follows:

In the next section, we will begin our data exploration by identifying missing data in our Dataset.
- Advanced Splunk
- DBA攻堅指南:左手Oracle,右手MySQL
- Twilio Best Practices
- WordPress Plugin Development Cookbook(Second Edition)
- 正則表達式經典實例(第2版)
- Jenkins Continuous Integration Cookbook(Second Edition)
- Visual Basic程序設計
- Creating Mobile Apps with jQuery Mobile(Second Edition)
- RealSenseTM互動開發實戰
- C#程序設計(項目教學版)
- Python+Office:輕松實現Python辦公自動化
- IPython Interactive Computing and Visualization Cookbook
- ROS機器人編程實戰
- 金融商業數據分析:基于Python和SAS
- 安卓工程師教你玩轉Android