官术网_书友最值得收藏!

  • Learning Apache Spark 2
  • Muhammad Asif Abbasi
  • 726字
  • 2021-07-09 18:45:58

Operations on RDD

Two major operation types can be performed on an RDD. They are called:

  • Transformations
  • Actions

Transformations

Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence are lazily evaluated, which is one of the main benefits of Spark.

An example of a transformation would be a map function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset.

Actions

Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result.

You should, however, remember to persist intermediate results; otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist() method on an RDD should help you avoid recomputation and saving intermediate results. We'll look at this in more detail later.

Let's illustrate the work of transformations and actions by a simple example. In this specific example, we'll be using flatmap() transformations and a count action. We'll use the README.md file from the local filesystem as an example. We'll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results:

//Loading the README.md file val dataFile = sc.textFile("README.md")

Now that the data has been loaded, we'll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we'll need to run a flatMap transformation and separate out inpidual words as separate elements, for which we'll use the split function and use space as a delimiter:

//Separate out a list of words from inpidual RDD elements val words = dataFile.flatMap(line => line.split(" ")) 

Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count() action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark, which is an area that is covered in the Passing Functions to Spark section later in this chapter. Now that we have another RDD of RDDs, we can call count() on the RDD to get the total number of elements within the RDD:

//Separate out a list of words from inpidual RDD elements words.count() 

Upon calling the count() action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications.

If you are Python savvy, you may want to run the following code in PySpark. You should note that lambda functions are passed to the Spark framework:

//Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count()

Programming the same functionality in Java is also quite straightforward and will look pretty similar to the program in Scala:

JavaRDD<String> lines = sc.textFile("README.md"); 
JavaRDD<String> words = lines.map(line -> line.split(" ")); 
int wordCount = words.count(); 

This might look like a simple program, but behind the scenes it is taking the line.split(" ") function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across with the cluster, and get the results back.

主站蜘蛛池模板: 宁安市| 蓝山县| 东莞市| 瓦房店市| 汝城县| 五指山市| 广州市| 江门市| 萝北县| 贵港市| 娄底市| 页游| 隆德县| 区。| 湘潭县| 东山县| 湖口县| 卢龙县| 信宜市| 西乌珠穆沁旗| 台中县| 皋兰县| 西乌珠穆沁旗| 大关县| 开封县| 黄大仙区| 商城县| 鄂托克旗| 宣恩县| 沅陵县| 宜丰县| 五莲县| 平阴县| 九龙县| 兴海县| 金沙县| 固阳县| 大埔区| 神木县| 雷州市| 介休市|