官术网_书友最值得收藏!

  • PySpark Cookbook
  • Denny Lee Tomasz Drabas
  • 239字
  • 2021-06-18 19:06:42

Introduction

In this chapter, we will explore the current fundamental data structure—DataFrames. DataFrames take advantage of the developments in the tungsten project and the Catalyst Optimizer. These two improvements bring the performance of PySpark on par with that of either Scala or Java.

Project tungsten is a set of improvements to Spark Engine aimed at bringing its execution process closer to the bare metal. The main deliverables include:

  • Code generation at runtime: This aims at leveraging the optimizations implemented in modern compilers
  • Taking advantage of the memory hierarchy: The algorithms and data structures exploit memory hierarchy for fast execution
  • Direct-memory management: Removes the overhead associated with Java garbage collection and JVM object creation and management
  • Low-level programming: Speeds up memory access by loading immediate data to CPU registers
  • Virtual function dispatches elimination: This eliminates the necessity of multiple CPU calls

Check this blog from Databricks for more information: https://www.databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.

The Catalyst Optimizer sits at the core of Spark SQL and powers both the SQL queries executed against the data and DataFrames. The process starts with the query being issued to the engine. The logical plan of execution is first being optimized. Based on the optimized logical plan, multiple physical plans are derived and pushed through a cost optimizer. The selected, most cost-efficient plan is then translated (using code generation optimizations implemented as part of the tungsten project) into an optimized RDD-based execution code.

主站蜘蛛池模板: 原阳县| 浙江省| 寿光市| 贵定县| 湛江市| 淮南市| 迁西县| 察隅县| 定安县| 台安县| 山阳县| 东阿县| 天祝| 当雄县| 寿光市| 靖西县| 阜新| 信丰县| 久治县| 阿鲁科尔沁旗| 紫云| 申扎县| 玉树县| 丰镇市| 永兴县| 隆回县| 仁布县| 绵竹市| 赫章县| 扎囊县| 霍林郭勒市| 惠州市| 沂水县| 嘉鱼县| 团风县| 阿巴嘎旗| 康保县| 中超| 霸州市| 石渠县| 衡阳市|