- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 239字
- 2021-06-18 19:06:42
Introduction
In this chapter, we will explore the current fundamental data structure—DataFrames. DataFrames take advantage of the developments in the tungsten project and the Catalyst Optimizer. These two improvements bring the performance of PySpark on par with that of either Scala or Java.
Project tungsten is a set of improvements to Spark Engine aimed at bringing its execution process closer to the bare metal. The main deliverables include:
- Code generation at runtime: This aims at leveraging the optimizations implemented in modern compilers
- Taking advantage of the memory hierarchy: The algorithms and data structures exploit memory hierarchy for fast execution
- Direct-memory management: Removes the overhead associated with Java garbage collection and JVM object creation and management
- Low-level programming: Speeds up memory access by loading immediate data to CPU registers
- Virtual function dispatches elimination: This eliminates the necessity of multiple CPU calls
Check this blog from Databricks for more information: https://www.databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.
The Catalyst Optimizer sits at the core of Spark SQL and powers both the SQL queries executed against the data and DataFrames. The process starts with the query being issued to the engine. The logical plan of execution is first being optimized. Based on the optimized logical plan, multiple physical plans are derived and pushed through a cost optimizer. The selected, most cost-efficient plan is then translated (using code generation optimizations implemented as part of the tungsten project) into an optimized RDD-based execution code.
- 解構產品經理:互聯網產品策劃入門寶典
- Android和PHP開發最佳實踐(第2版)
- Oracle Database In-Memory(架構與實踐)
- Practical Internet of Things Security
- Instant 960 Grid System
- 網店設計看這本就夠了
- Apache Mesos Essentials
- 數據結構案例教程(C/C++版)
- 用案例學Java Web整合開發
- Vue.js 2 Web Development Projects
- R數據科學實戰:工具詳解與案例分析
- Training Systems Using Python Statistical Modeling
- Julia數據科學應用
- Android Studio開發實戰:從零基礎到App上線 (移動開發叢書)
- WordPress Search Engine Optimization(Second Edition)