官术网_书友最值得收藏!

Word count on RDD

Let's run a word count problem on stringRDD. Word count is the HelloWorld of the big data world. Word count means that we will count the occurrence of each word in the RDD:

So first we will create pairRDD as follows:

scala>valpairRDD=stringRdd.map( s => (s,1))
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at <console>:26

The pairRDD consists of pairs of the word and one (integer) where word represents strings of stringRDD.

Now, we will run the reduceByKey operation on this RDD to count the occurrence of each word as follows:

scala>valwordCountRDD=pairRDD.reduceByKey((x,y) =>x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28

Now, let's run collect on it to see the result:

scala>valwordCountList=wordCountRDD.collect
wordCountList: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))
scala>wordCountList
res3: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))

As per the output of wordCountList, every string in stringRDD appears once expect Java, which appeared twice.

It is shown in the following screenshot:

主站蜘蛛池模板: 金溪县| 新干县| 慈利县| 枣阳市| 霸州市| 新源县| 蕉岭县| 怀仁县| 金塔县| 余姚市| 张掖市| 平阴县| 芦山县| 海安县| 辉县市| 莆田市| 奇台县| 腾冲县| 剑河县| 乡城县| 米脂县| 乌苏市| 宁陵县| 襄城县| 西峡县| 鄯善县| 文成县| 辽阳市| 衡阳市| 扶绥县| 湄潭县| 江津市| 安吉县| 奇台县| 柘城县| 武安市| 潮安县| 罗城| 蕉岭县| 房山区| 玛多县|