官术网_书友最值得收藏!

Understanding data sources in Spark applications

Spark can connect to many different data sources, including files, and SQL and NoSQL databases. Some of the more popular data sources include files (CSV, JSON, Parquet, AVRO), MySQL, MongoDB, HBase, and Cassandra.

In addition, it can also connect to special purpose engines and data sources, such as ElasticSearch, Apache Kafka, and Redis. These engines enable specific functionality in Spark applications such as search, streaming, caching, and so on. For example, Redis enables deployment of cached machine learning models in high performance applications. We discuss more on Redis-based application deployment in Chapter 12, Spark SQL in Large-Scale Application Architectures. Kafka is extremely popular in Spark streaming applications, and we will cover more details on Kafka-based streaming applications in Chapter 5, Using Spark SQL in Streaming Applications, and Chapter 12Spark SQL in Large-Scale Application Architectures. The DataSource API enables Spark connectivity to a wide variety of data sources including custom data sources.

Refer to the Spark packages website https://spark-packages.org/ to work with various data sources, algorithms, and specialized Datasets.

In Chapter 1, Getting Started with Spark SQL,  we used CSV and JSON files on our filesystem as input data sources and used SQL to query them. However, using Spark SQL to query data residing in files is not a replacement for using databases. Initially, some people used HDFS as a data source because of the simplicity and the ease of using Spark SQL for querying such data. However, the execution performance can vary significantly based on the queries being executed and the nature of the workloads. Architects and developers need to understand which data stores to use in order to best meet their processing requirements. We discuss some high-level considerations for selecting Spark data sources below.

主站蜘蛛池模板: 博兴县| 迁西县| 普洱| 正宁县| 蕲春县| 伊金霍洛旗| 德清县| 安化县| 灵台县| 永济市| 华宁县| 仙游县| 佳木斯市| 邵武市| 福建省| 邓州市| 莱芜市| 松阳县| 肥乡县| 青海省| 略阳县| 九江县| 城固县| 屯昌县| 浦城县| 丰顺县| 武乡县| 双辽市| 衡东县| 海淀区| 通山县| 丽江市| 清河县| 青龙| 正镶白旗| 米脂县| 南皮县| 宁化县| 文水县| 浦东新区| 六安市|