- Elasticsearch for Hadoop
- Vishal Shukla
- 731字
- 2021-07-09 21:34:29
Running the WordCount example
Now that we have got our ES-Hadoop environment tested and running, we are all set to run our first WordCount
example. In the Hadoop world, WordCount has made its place to replace the HelloWorld program, hasn't it?
Getting the examples and building the job JAR file
You can download the examples in the book from https://github.com/vishalbrevitaz/eshadoop/tree/master/ch01. Once you have got the source code, you can build the JAR file for this chapter using the steps mentioned in the readme
file in the source code zip. The build process should generate a ch01-0.0.1-job.jar
file under the <SOURCE_CODE_BASE_DIR>/ch01/target
directory.
Importing the test file to HDFS
For our WordCount
example, you can use any text file of your choice. To explain the example, we will use the sample.txt
file that is part of the source zip. Perform the following steps:
- First, let's create a nice directory structure in HDFS to manage our input files with the following command:
$ hadoop fs -mkdir /input $ hadoop fs -mkdir /input/ch01
- Next, upload the
sample.txt
file to HDFS at the desired location, by using the following command:$ hadoop fs -put data/ch01/sample.txt /input/ch01/sample.txt
- Now, verify that the file is successfully imported to HDFS by using the following command:
$ hadoop fs -ls /input/ch01
Finally, when you execute the preceding command, it should show an output similar to the following code:
Found 1 items -rw-r--r-- 1 eshadoop supergroup 2803 2015-05-10 15:18 /input/ch01/sample.txt
Running our first job
We are ready with the job JAR file; its sample file is imported to HDFS. Point your terminal to the <SOURCE_CODE_BASE_DIR>/ch01/target
directory and run the following command:
$ hadoop jar ch01-0.0.1-job.jar /input/ch01/sample.txt
Now you'll get the following output:
15/05/10 15:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/05/10 15:21:34 WARN mr.EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption 15/05/10 15:21:34 INFO util.Version: Elasticsearch Hadoop v2.0.2 [ca81ff6732] 15/05/10 15:21:34 INFO mr.EsOutputFormat: Writing to [eshadoop/wordcount] 15/05/10 15:21:35 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 15/05/10 15:21:41 INFO input.FileInputFormat: Total input paths to process : 1 15/05/10 15:21:42 INFO mapreduce.JobSubmitter: number of splits:1 15/05/10 15:21:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1431251282365_0002 15/05/10 15:21:42 INFO impl.YarnClientImpl: Submitted application application_1431251282365_0002 15/05/10 15:21:42 INFO mapreduce.Job: The url to track the job: http://eshadoop:8088/proxy/application_1431251282365_0002/ 15/05/10 15:21:42 INFO mapreduce.Job: Running job: job_1431251282365_0002 15/05/10 15:21:54 INFO mapreduce.Job: Job job_1431251282365_0002 running in uber mode : false 15/05/10 15:21:54 INFO mapreduce.Job: map 0% reduce 0% 15/05/10 15:22:01 INFO mapreduce.Job: map 100% reduce 0% 15/05/10 15:22:09 INFO mapreduce.Job: map 100% reduce 100% 15/05/10 15:22:10 INFO mapreduce.Job: Job job_1431251282365_0002 completed successfully … … … Elasticsearch Hadoop Counters Bulk Retries=0 Bulk Retries Total Time(ms)=0 Bulk Total=1 Bulk Total Time(ms)=48 Bytes Accepted=9655 Bytes Received=4000 Bytes Retried=0 Bytes Sent=9655 Documents Accepted=232 Documents Received=0 Documents Retried=0 Documents Sent=232 Network Retries=0 Network Total Time(ms)=84 Node Retries=0 Scroll Total=0 Scroll Total Time(ms)=0
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We just executed our first Hadoop MapReduce job that uses and imports data to Elasticsearch. This MapReduce job simply outputs the count of each word in the Mapper phase, and Reducer calculates the sum of all the counts for each word. We will dig into greater details of how exactly this WordCount program is developed in the next chapter. The console output of the job displays the useful log information to indicate the progress of the job execution. It also displays the ES-Hadoop counters that provide some handy information about the amount of data and documents being sent and received, the number of retries, the time taken, and so on. If you have used the sample.txt
file provided in the source zip, you will be able to see that the job found 232 unique words and all of them are pushed as the Elasticsearch document. In the next section, we will examine these documents with the Elasticsearch Head and Marvel plugin that we already installed in Elasticsearch. Note that you can also track the status of your ES-Hadoop MapReduce jobs, similar to any other Hadoop jobs, in the job tracker. In our setup, you can access the job tracker at http://localhost:8088/cluster
.
- 軟件架構設計:大型網站技術架構與業務架構融合之道
- INSTANT CakePHP Starter
- 零基礎學MQL:基于EA的自動化交易編程
- Unity Shader入門精要
- 人人都是網站分析師:從分析師的視角理解網站和解讀數據
- Mastering Android Development with Kotlin
- Java SE實踐教程
- 小程序,巧應用:微信小程序開發實戰(第2版)
- Qt5 C++ GUI Programming Cookbook
- .NET Standard 2.0 Cookbook
- Visual FoxPro 6.0程序設計
- Java Web從入門到精通(第2版)
- C語言程序設計
- Python大規模機器學習
- Learning Alfresco Web Scripts