官术网_书友最值得收藏!

Using Spark with JSON data

JSON is a simple, flexible, and compact format used extensively as a data-interchange format in web services. Spark's support for JSON is great. There is no need for defining the schema for the JSON data, as the schema is automatically inferred. In addition, Spark greatly simplifies the query syntax required to access fields in complex JSON data structures. We will present detailed examples of JSON data in Chapter 12, Spark SQL in Large-Scale Application Architectures

The dataset for this example contains approximately 1.69 million Amazon reviews for the electronics category, and can be downloaded from: http://jmcauley.ucsd.edu/data/amazon/.

We can directly read a JSON dataset to create Spark SQL DataFrame. We will read in a sample set of order records from a JSON file:

scala>val reviewsDF = spark.read.json("file:///Users/aurobindosarkar/Downloads/reviews_Electronics_5.json")

You can print the schema of the newly created DataFrame to verify the fields and their characteristics using the printSchema method. 

scala> reviewsDF.printSchema()

Once, the JSON Dataset is converted to a Spark SQL DataFrame, you can work with it extensively in a standard way. Next, we will execute an SQL statement to select certain columns from our orders that are received from customers in a specific age bracket:

scala>reviewsDF.createOrReplaceTempView("reviewsTable")
scala>val selectedDF = spark.sql("SELECT asin, overall, reviewTime, reviewerID, reviewerName FROM reviewsTable WHERE overall >= 3")

Display the results of the SQL execution (stored in another DataFrame) using the show method, as follows:

scala> selectedDF.show()

We can access the array elements of the helpful column in the reviewDF DataFrame (using DSL) as shown: 

scala> val selectedJSONArrayElementDF = reviewsDF.select($"asin", $"overall", $"helpful").where($"helpful".getItem(0) < 3)

scala>selectedJSONArrayElementDF.show()

An example of writing out a DataFrame as a JSON file was presented in an earlier section where we selected the columns of interest from the DataFrame (containing data other than for the current month), and wrote them out to the HDFS filesystem in JSON format.

主站蜘蛛池模板: 乌拉特中旗| 西畴县| 双柏县| 兴隆县| 临沧市| 奉节县| 揭阳市| 宜宾县| 平阳县| 石景山区| 棋牌| 松潘县| 连州市| 高青县| 沙田区| 大埔县| 微山县| 望奎县| 工布江达县| 广灵县| 蓬溪县| 巴彦淖尔市| 获嘉县| 健康| 安图县| 正阳县| 高邮市| 威海市| 莆田市| 巴塘县| 梁河县| 黔南| 吉水县| 广元市| 厦门市| 呼玛县| 柳林县| 滁州市| 东台市| 缙云县| 甘德县|