官术网_书友最值得收藏!

  • PySpark Cookbook
  • Denny Lee Tomasz Drabas
  • 169字
  • 2021-06-18 19:06:36

.take(...) method

Now that you have created your RDD (myRDD), we will use the take() method to return the values to the console (or notebook cell). We will now execute an RDD action (more information on this in subsequent recipes), take(). Note that a common approach in PySpark is to use collect(), which returns all values in your RDD from the Spark worker nodes to the driver. There are performance implications when working with a large amount of data as this translates to large volumes of data being transferred from the Spark worker nodes to the driver. For small amounts of data (such as this recipe), this is perfectly fine, but, as a matter of habit, you should pretty much always use the take(n) method instead; it returns the first n elements of the RDD instead of the whole dataset. It is a more efficient method because it first scans one partition and uses those statistics to determine the number of partitions required to return the results.

主站蜘蛛池模板: 海门市| 灵武市| 鹤山市| 北辰区| 嫩江县| 上犹县| 公安县| 湖州市| 增城市| 黔江区| 苍梧县| 商水县| 台江县| 青神县| 建始县| 辉南县| 廊坊市| 黄大仙区| 如皋市| 平远县| 咸丰县| 页游| 沁阳市| 徐州市| 台东市| 昌都县| 谷城县| 高阳县| 保靖县| 武安市| 孟津县| 金昌市| 尚义县| 会宁县| 宁化县| 芒康县| 乌审旗| 湘阴县| 喜德县| 商南县| 凤冈县|