官术网_书友最值得收藏!

Spark SQL

Spark SQL is where developers can work with structured and semi-structured data such as Hive tables, MySQL tables, Parquet files, AVRO files, JSON files, CSV files, and more. Another alternative to process structured data is using Hive. Hive processes structured data stored on HDFS using Hive Query Language (HQL). It internally uses MapReduce for its processing, and we shall see how Spark can deliver better performance than MapReduce. In the initial version of Spark, structured data used to be defined as schema RDD (another type of an RDD). When there is data along with the schema, SQL becomes the first choice of processing that data. Spark SQL is Spark's component that enables developers to process data with Structured Query Language (SQL).

Using Spark SQL, business logic can be easily written in SQL and HQL. This enables data warehouse engineers with a good knowledge of SQL to make use of Spark for their extract, transform, load (ETL) processing. Hive projects can easily be migrated on Spark using Spark SQL, without changing the Hive scripts. 

Spark SQL is also the first choice for data analysis and data warehousing. Spark SQL enables the data analysts to write ad hoc queries for their exploratory analysis. Spark provides Spark SQL shell, where you can run the SQL-like queries and they get executed on Spark. Spark internally converts the code into a chain of RDD computations, while Hive converts the HQL job into a series of MapReduce jobs. Using Spark SQL, developers can also make use of caching (a Spark feature that enables data to be kept in memory), which can significantly increase the performance of their queries.

主站蜘蛛池模板: 环江| 大足县| 康马县| 兴文县| 台前县| 永康市| 清原| 高唐县| 梅河口市| 长顺县| 酒泉市| 界首市| 大理市| 金塔县| 岐山县| 巧家县| 黄平县| 林芝县| 花莲市| 呼玛县| 德庆县| 贺州市| 祁连县| 嘉荫县| 上饶市| 上犹县| 长春市| 手游| 巍山| 馆陶县| 洛浦县| 峨眉山市| 富蕴县| 龙陵县| 井冈山市| 吉木乃县| 内乡县| 巨鹿县| 乌什县| 汾阳市| 安多县|