書名： Learning Spark SQL
作者名： Aurobindo Sarkar
本章字數： 310字
更新時間： 2021-07-02 18:23:44

Using Spark with JSON data

JSON is a simple, flexible, and compact format used extensively as a data-interchange format in web services. Spark's support for JSON is great. There is no need for defining the schema for the JSON data, as the schema is automatically inferred. In addition, Spark greatly simplifies the query syntax required to access fields in complex JSON data structures. We will present detailed examples of JSON data in Chapter 12, Spark SQL in Large-Scale Application Architectures.

The dataset for this example contains approximately 1.69 million Amazon reviews for the electronics category, and can be downloaded from: http://jmcauley.ucsd.edu/data/amazon/.

We can directly read a JSON dataset to create Spark SQL DataFrame. We will read in a sample set of order records from a JSON file:

scala>val reviewsDF = spark.read.json("file:///Users/aurobindosarkar/Downloads/reviews_Electronics_5.json")

You can print the schema of the newly created DataFrame to verify the fields and their characteristics using the printSchema method.

scala> reviewsDF.printSchema()

Once, the JSON Dataset is converted to a Spark SQL DataFrame, you can work with it extensively in a standard way. Next, we will execute an SQL statement to select certain columns from our orders that are received from customers in a specific age bracket:

scala>reviewsDF.createOrReplaceTempView("reviewsTable")
scala>val selectedDF = spark.sql("SELECT asin, overall, reviewTime, reviewerID, reviewerName FROM reviewsTable WHERE overall >= 3")

Display the results of the SQL execution (stored in another DataFrame) using the show method, as follows:

scala> selectedDF.show()

We can access the array elements of the helpful column in the reviewDF DataFrame (using DSL) as shown:

scala> val selectedJSONArrayElementDF = reviewsDF.select($"asin", $"overall", $"helpful").where($"helpful".getItem(0) < 3)

scala>selectedJSONArrayElementDF.show()

An example of writing out a DataFrame as a JSON file was presented in an earlier section where we selected the columns of interest from the DataFrame (containing data other than for the current month), and wrote them out to the HDFS filesystem in JSON format.

官术网_书友最值得收藏!

Learning Spark SQL

Using Spark with JSON data