- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 203字
- 2021-07-02 18:55:26
Coding
Try to tune your code to improve the Spark application performance. For instance, filter your application-based data early in your ETL cycle. One example is when using raw HTML files, detag them and crop away unneeded parts at an early stage. Tune your degree of parallelism, try to find the resource-expensive parts of your code, and find alternatives.
ETL is one of the first things you are doing in an analytics project. So you are grabbing data from third-party systems, either by directly accessing relational or NoSQL databases or by reading exports in various file formats such as, CSV, TSV, JSON or even more exotic ones from local or remote filesystems or from a staging area in HDFS: after some inspections and sanity checks on the files an ETL process in Apache Spark basically reads in the files and creates RDDs or DataFrames/Datasets out of them.
They are transformed so they fit to the downstream analytics application running on top of Apache Spark or other applications and then stored back into filesystems as either JSON, CSV or PARQUET files, or even back to relational or NoSQL databases.
Finally, I can recommend the following resource for any performance-related problems with Apache Spark: https://spark.apache.org/docs/latest/tuning.html.
推薦閱讀
- Functional Python Programming
- Advanced Quantitative Finance with C++
- WildFly:New Features
- 自己動手寫搜索引擎
- 零基礎(chǔ)搭建量化投資系統(tǒng):以Python為工具
- Java FX應(yīng)用開發(fā)教程
- SEO智慧
- JavaScript by Example
- Windows Forensics Cookbook
- Java編程技術(shù)與項(xiàng)目實(shí)戰(zhàn)(第2版)
- KnockoutJS Starter
- UVM實(shí)戰(zhàn)
- .NET Standard 2.0 Cookbook
- Learning Splunk Web Framework
- PyQt編程快速上手