書名： Learning Spark SQL
作者名： Aurobindo Sarkar
本章字?jǐn)?shù)： 628字
更新時(shí)間： 2021-07-02 18:23:43

Selecting Spark data sources

Filesystems are a great place to dump large volumes of data and for supporting general purpose processing of large Datasets. Some of the benefits you will get by using files are inexpensive storage, flexible processing, and scale. The decision to store large-scale data in files is usually driven by the prohibitive costs of storing the same on commercial databases. Additionally, file storage is also preferred when the nature of the data does not benefit from typical database optimizations, for example, unstructured data. Additionally, workloads, such as machine learning applications, with iterative in-memory processing requirements and distributed algorithms may be better suited to run on distributed file systems.

The types of data you would typically store on filesystems are archival data, unstructured data, massive social media and other web-scale Datasets, and backup copies of primary data stores. The types of workloads best supported on files are batch workloads, exploratory data analysis, multistage processing pipelines, and iterative workloads. Popular use cases for using files include ETL pipelines, splicing data across varied data sources, such as log files, CSV, Parquet, zipped file formats, and so on. In addition, you can choose to store the same data in multiple formats optimized for your specific processing requirements.

What's not so great about Spark connected to a filesystem are use cases involving frequent random accesses, frequent inserts, frequent/incremental updates, and reporting or search operations under heavy load conditions across many users. These use cases are discussed in more detail as we move on.

Queries selecting a small subset of records from your distributed storage are supported in Spark but are not very efficient, because it would typically require Spark to go through all your files to find your result row(s). This may be acceptable for data exploration tasks but not for sustained processing loads from several concurrent users. If you need to frequently and randomly access your data, using a database can be a more effective solution. Making the data available to your users using a traditional SQL database and creating indexes on the key columns can better support this use case. Alternatively, key-value NoSQL stores can also retrieve the value of a key a lot more efficiently.

As each insert creates a new file, the inserts are reasonably fast however querying becomes low as the Spark jobs will need to open all these files and read from them to support queries. Again, a database used to support frequent inserts may be a much better solution. Alternatively, you can also routinely compact your Spark SQL table files to reduce the overall number of files. Use the Select * and coalesce DataFrame commands to write the data out from a DataFrame created from multiple input files to a single / combined output file.

Other operations and use cases, such as frequent/incremental updates, reporting, and searching are better handled using databases or specialized engines. Files are not optimized for updating random rows. However, databases are ideal for executing efficient update operations. You can connect Spark to HDFS and use BI tools, such as Tableau, but it is better to dump the data to a database for serving concurrent users under load. Typically, it is better to use Spark to read the data, perform aggregations, and so on, and then write the results out to a database that serves end users. In the search use case, Spark will need to go through each row to find and return the search results, thereby impacting performance. In this case, using the specialized engines such as ElasticSearch and Apache Solr may be a better solution than using Spark.

In cases where the data is heavily skewed, or for executing faster joins on a cluster, we can use cluster by or bucketing techniques to improve performance.

官术网_书友最值得收藏!

Learning Spark SQL

Selecting Spark data sources