- Learning Spark SQL
- Aurobindo Sarkar
- 253字
- 2021-07-02 18:23:53
Processing multiple input data files
In the next few steps, we initialize a set of variables for defining the directory containing the input files, and an empty RDD. We also create a list of filenames from the input HDFS directory. In the following example, we will work with files contained in a single directory; however, the techniques can easily be extended across all 20 newsgroup sub-directories.

Next, we write a function to compute the word counts for each file and collect the results in an ArrayBuffer:

We have included a print statement to display the file names as they are picked up for processing, as follows:

We add the rows into a single RDD using the union operation:

We could have directly executed the union step as each file is processed, as follows:

However, using RDD.union() creates a new step in the lineage graph requiring an extra set of stack frames for each new RDD. This can easily lead to a Stack Overflow condition. Instead, we use SparkContext.union() which executes the union operation all at once without the extra memory overheads.
We can cache and print sample rows from our output RDD as follows:

In the next section, we show you ways of filtering out stop words. For simplicity, we focus only on well-formed words in the text. However, you can easily add conditions to filter out special characters and other anomalies in our data using String functions and regexes (for a detailed example, refer Chapter 9, Developing Applications with Spark SQL).
- UI圖標創(chuàng)意設計
- JSP網絡編程(學習筆記)
- MATLAB 2020 從入門到精通
- Java編程指南:基礎知識、類庫應用及案例設計
- Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
- Microsoft System Center Orchestrator 2012 R2 Essentials
- Android移動開發(fā)案例教程:基于Android Studio開發(fā)環(huán)境
- C#程序設計(項目教學版)
- Access 2010數據庫應用技術實驗指導與習題選解(第2版)
- 區(qū)塊鏈國產化實踐指南:基于Fabric 2.0
- FPGA嵌入式項目開發(fā)實戰(zhàn)
- Practical Microservices
- jQuery for Designers Beginner's Guide Second Edition
- Android智能手機APP界面設計實戰(zhàn)教程
- Distributed Computing with Python