- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 224字
- 2021-07-02 19:02:01
Counting the number of words in a file
Let's read the file people.txt placed in $SPARK_HOME/examples/src/main/resources:

scala>val file=sc.textFile("/usr/local/spark/examples/src/main/resources/people.txt")
file: org.apache.spark.rdd.RDD[String] = /usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24
The next step is to flatten the contents of the file, that is, we will create an RDD by splitting each line with , and flatten all the words in the list, as follows:
scala>valflattenFile = file.flatMap(s =>s.split(", "))
flattenFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at <console>:26
The contents of flattenFile RDD looks as follows:
scala>flattenFile.collect
res5: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)
Now, we can count all the words in this RDD as follows:
scala>val count = flattenFile.count
count: Long = 6
scala> count
res2: Long = 6
It is shown in the following screenshot:

Whenever any action such as count gets called, the Spark creates a directed acyclic graph (DAG) to depict the lineage dependency of each RDD. Spark provides a debug method toDebugString() to show such lineage dependencies of the RDD:
scala>flattenFile.toDebugString
It is shown in the following screenshot:

The indentations represent the shuffle while the number in the parentheses indicates the parallelism level at each stage.
In this section, we became familiar with some Spark CLI concepts. In the next section, we will discuss various components of Spark job.
- Git Version Control Cookbook
- Vue.js 2 and Bootstrap 4 Web Development
- iOS開發(fā)實(shí)戰(zhàn):從零基礎(chǔ)到App Store上架
- Python數(shù)據(jù)分析(第2版)
- Python 3破冰人工智能:從入門到實(shí)戰(zhàn)
- 精通Python自然語言處理
- Windows Forensics Cookbook
- C程序設(shè)計(jì)案例教程
- Spring Boot Cookbook
- 零基礎(chǔ)輕松學(xué)SQL Server 2016
- Mastering openFrameworks:Creative Coding Demystified
- Mastering React
- C#開發(fā)案例精粹
- 硬件產(chǎn)品設(shè)計(jì)與開發(fā):從原型到交付
- Mastering ASP.NET Core 2.0