官术网_书友最值得收藏!

  • Learning Apache Spark 2
  • Muhammad Asif Abbasi
  • 375字
  • 2021-07-09 18:45:58

Passing functions to Spark (Python)

Python provides a simple way to pass functions to Spark. The Spark programming guide available at spark.apache.org suggests there are three recommended ways to do this:

  • Lambda expressions is the ideal way for short functions that can be written inside a single expression
  • Local defs inside the function calling into Spark for longer code
  • Top-level functions in a module

While we have already looked at the lambda functions in some of the previous examples, let's look at local definitions of the functions. We can encapsulate our business logic which is splitting of words, and counting into two separate functions as shown below.

def splitter(lineOfText): words = lineOfText.split(" ") return len(words) def aggregate(numWordsLine1, numWordsLineNext): totalWords = numWordsLine1 + numWordsLineNext return totalWords 

Let's see the working code example:

Figure 2.15: Code example of Python word count (local definition of functions)

Here's another way to implement this by defining the functions as a part of a UtilFunctions class, and referencing them within your map and reduce functions:

Figure 2.16: Code example of Python word count (Utility class)

You may want to be a bit cheeky here and try to add a countWords() to the UtilFunctions, so that it takes an RDD as input, and returns the total number of words. This method has potential performance implications as the whole object will need to be sent to the cluster. Let's see how this can be implemented and the results in the following screenshot:

Figure 2.17: Code example of Python word count (Utility class - 2)

This can be avoided by making a copy of the referenced data field in a local object, rather than accessing it externally.

Now that we have had a look at how to pass functions to Spark, and have already looked at some of the transformations and actions in the previous examples, including map, flatMap, and reduce, let's look at the most common transformations and actions used in Spark. The list is not exhaustive, and you can find more examples in the Apache Spark documentation in the programming guide section (http://bit.ly/SparkProgrammingGuide). If you would like to get a comprehensive list of all the available functions, you might want to check the following API docs:

Table 2.1 - RDD and PairRDD API references

主站蜘蛛池模板: 霞浦县| 本溪市| 准格尔旗| 安义县| 荣昌县| 扎赉特旗| 厦门市| 吉水县| 龙泉市| 宁强县| 泰来县| 砚山县| 洱源县| 高安市| 宣化县| 营山县| 壶关县| 东莞市| 杭锦后旗| 双鸭山市| 惠州市| 固镇县| 宁远县| 方城县| 阿克苏市| 天台县| 瓦房店市| 岚皋县| 靖安县| 安溪县| 昌邑市| 济宁市| 永和县| 富锦市| 邢台市| 成安县| 华宁县| 沈丘县| 靖安县| 临夏县| 应用必备|