官术网_书友最值得收藏!

Creating and filtering RDD

Let's start by creating an RDD of strings:

scala>val stringRdd=sc.parallelize(Array("Java","Scala","Python","Ruby","JavaScript","Java"))
stringRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

Now, we will filter this RDD to keep only those strings that start with the letter J:

scala>valfilteredRdd = stringRdd.filter(s =>s.startsWith("J"))
filteredRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:26

In the first chapter, we learnt that if an operation on RDD returns an RDD then it is a transformation, or else it is an action.

The output of the preceding command clearly shows that filter the operation returned an RDD so the filter is a transformation.

Now, we will run an action on filteredRdd to see it's elements. Let's run collect on the filteredRdd:

scala>val list = filteredRdd.collect
list: Array[String] = Array(Java, JavaScript, Java)

As per the output of the previous command, the collect operation returned an array of strings. So, it is an action.

Now, let's see the elements of the list variable:

scala> list
res5: Array[String] = Array(Java, JavaScript, Java)

We are left with only elements that start with J, which was our desired outcome:

主站蜘蛛池模板: 濮阳市| 梅河口市| 昌黎县| 西乌珠穆沁旗| 邯郸县| 宝坻区| 万盛区| 隆林| 永康市| 陵川县| 临沂市| 哈巴河县| 陆丰市| 吉木萨尔县| 望江县| 星子县| 攀枝花市| 高邮市| 菏泽市| 永春县| 吉水县| 茂名市| 南汇区| 石景山区| 嘉义市| 西充县| 巴林右旗| 白河县| 靖江市| 呈贡县| 沭阳县| 怀柔区| 淳安县| 崇明县| 天全县| 达孜县| 巴彦淖尔市| 仁布县| 沙洋县| 上犹县| 高安市|