- Learning Spark SQL
- Aurobindo Sarkar
- 310字
- 2021-07-02 18:23:44
Using Spark with JSON data
JSON is a simple, flexible, and compact format used extensively as a data-interchange format in web services. Spark's support for JSON is great. There is no need for defining the schema for the JSON data, as the schema is automatically inferred. In addition, Spark greatly simplifies the query syntax required to access fields in complex JSON data structures. We will present detailed examples of JSON data in Chapter 12, Spark SQL in Large-Scale Application Architectures.
The dataset for this example contains approximately 1.69 million Amazon reviews for the electronics category, and can be downloaded from: http://jmcauley.ucsd.edu/data/amazon/.
We can directly read a JSON dataset to create Spark SQL DataFrame. We will read in a sample set of order records from a JSON file:
scala>val reviewsDF = spark.read.json("file:///Users/aurobindosarkar/Downloads/reviews_Electronics_5.json")
You can print the schema of the newly created DataFrame to verify the fields and their characteristics using the printSchema method.
scala> reviewsDF.printSchema()

Once, the JSON Dataset is converted to a Spark SQL DataFrame, you can work with it extensively in a standard way. Next, we will execute an SQL statement to select certain columns from our orders that are received from customers in a specific age bracket:
scala>reviewsDF.createOrReplaceTempView("reviewsTable")
scala>val selectedDF = spark.sql("SELECT asin, overall, reviewTime, reviewerID, reviewerName FROM reviewsTable WHERE overall >= 3")
Display the results of the SQL execution (stored in another DataFrame) using the show method, as follows:
scala> selectedDF.show()
We can access the array elements of the helpful column in the reviewDF DataFrame (using DSL) as shown:
scala> val selectedJSONArrayElementDF = reviewsDF.select($"asin", $"overall", $"helpful").where($"helpful".getItem(0) < 3)
scala>selectedJSONArrayElementDF.show()

An example of writing out a DataFrame as a JSON file was presented in an earlier section where we selected the columns of interest from the DataFrame (containing data other than for the current month), and wrote them out to the HDFS filesystem in JSON format.
- Vue 3移動Web開發與性能調優實戰
- Practical Data Analysis Cookbook
- C++面向對象程序設計(第三版)
- DevOps:軟件架構師行動指南
- Python從小白到大牛
- Practical Windows Forensics
- JavaScript by Example
- JavaScript動態網頁開發詳解
- SharePoint Development with the SharePoint Framework
- iOS編程基礎:Swift、Xcode和Cocoa入門指南
- 單片機應用與調試項目教程(C語言版)
- Python編程從0到1(視頻教學版)
- Kubernetes進階實戰
- 跟戴銘學iOS編程:理順核心知識點
- Implementing Microsoft Dynamics NAV(Third Edition)