官术网_书友最值得收藏!

  • Learning Spark SQL
  • Aurobindo Sarkar
  • 253字
  • 2021-07-02 18:23:53

Processing multiple input data files

In the next few steps, we initialize a set of variables for defining the directory containing the input files, and an empty RDD. We also create a list of filenames from the input HDFS directory. In the following example, we will work with files contained in a single directory; however, the techniques can easily be extended across all 20 newsgroup sub-directories.

Next, we write a function to compute the word counts for each file and collect the results in an ArrayBuffer:

We have included a print statement to display the file names as they are picked up for processing, as follows:

We add the rows into a single RDD using the union operation:

We could have directly executed the union step as each file is processed, as follows:

However, using RDD.union() creates a new step in the lineage graph requiring an extra set of stack frames for each new RDD. This can easily lead to a Stack Overflow condition. Instead, we use SparkContext.union() which executes the union operation all at once without the extra memory overheads.

We can cache and print sample rows from our output RDD as follows:

In the next section, we show you ways of filtering out stop words. For simplicity, we focus only on well-formed words in the text. However, you can easily add conditions to filter out special characters and other anomalies in our data using String functions and regexes (for a detailed example, refer Chapter 9Developing Applications with Spark SQL).

主站蜘蛛池模板: 松阳县| 余庆县| 佛教| 宁城县| 北流市| 共和县| 临澧县| 深水埗区| 南康市| 抚宁县| 德钦县| 惠安县| 陕西省| 瑞金市| 望城县| 葫芦岛市| 东乡族自治县| 稻城县| 镇宁| 宜阳县| 荔波县| 紫阳县| 和平区| 习水县| 康马县| 阿尔山市| 祁东县| 上栗县| 博爱县| 留坝县| 永善县| 樟树市| 从化市| 石渠县| 永康市| 阳泉市| 阜城县| 唐山市| 和静县| 德昌县| 太仆寺旗|