母牛什么梗

書名： Elasticsearch for Hadoop
作者名： Vishal Shukla
本章字數： 731字
更新時間： 2021-07-09 21:34:29

Running the WordCount example

Now that we have got our ES-Hadoop environment tested and running, we are all set to run our first WordCount example. In the Hadoop world, WordCount has made its place to replace the HelloWorld program, hasn't it?

Getting the examples and building the job JAR file

You can download the examples in the book from https://github.com/vishalbrevitaz/eshadoop/tree/master/ch01. Once you have got the source code, you can build the JAR file for this chapter using the steps mentioned in the readme file in the source code zip. The build process should generate a ch01-0.0.1-job.jar file under the <SOURCE_CODE_BASE_DIR>/ch01/target directory.

Importing the test file to HDFS

For our WordCount example, you can use any text file of your choice. To explain the example, we will use the sample.txt file that is part of the source zip. Perform the following steps:

First, let's create a nice directory structure in HDFS to manage our input files with the following command:
```
$ hadoop fs -mkdir /input
$ hadoop fs -mkdir /input/ch01
```
Next, upload the sample.txt file to HDFS at the desired location, by using the following command:
```
$ hadoop fs -put data/ch01/sample.txt /input/ch01/sample.txt 
```
Now, verify that the file is successfully imported to HDFS by using the following command:
```
$ hadoop fs -ls /input/ch01
```
Finally, when you execute the preceding command, it should show an output similar to the following code:
```
Found 1 items
-rw-r--r-- 1 eshadoop supergroup 2803 2015-05-10 15:18 /input/ch01/sample.txt 
```

Running our first job

We are ready with the job JAR file; its sample file is imported to HDFS. Point your terminal to the <SOURCE_CODE_BASE_DIR>/ch01/target directory and run the following command:

$ hadoop jar ch01-0.0.1-job.jar /input/ch01/sample.txt

Now you'll get the following output:

15/05/10 15:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/05/10 15:21:34 WARN mr.EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption
15/05/10 15:21:34 INFO util.Version: Elasticsearch Hadoop v2.0.2 [ca81ff6732]
15/05/10 15:21:34 INFO mr.EsOutputFormat: Writing to [eshadoop/wordcount]
15/05/10 15:21:35 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/10 15:21:41 INFO input.FileInputFormat: Total input paths to process : 1
15/05/10 15:21:42 INFO mapreduce.JobSubmitter: number of splits:1
15/05/10 15:21:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1431251282365_0002
15/05/10 15:21:42 INFO impl.YarnClientImpl: Submitted application application_1431251282365_0002
15/05/10 15:21:42 INFO mapreduce.Job: The url to track the job: http://eshadoop:8088/proxy/application_1431251282365_0002/
15/05/10 15:21:42 INFO mapreduce.Job: Running job: job_1431251282365_0002
15/05/10 15:21:54 INFO mapreduce.Job: Job job_1431251282365_0002 running in uber mode : false
15/05/10 15:21:54 INFO mapreduce.Job: map 0% reduce 0%
15/05/10 15:22:01 INFO mapreduce.Job: map 100% reduce 0%
15/05/10 15:22:09 INFO mapreduce.Job: map 100% reduce 100%
15/05/10 15:22:10 INFO mapreduce.Job: Job job_1431251282365_0002 completed successfully
…
…
…
 Elasticsearch Hadoop Counters
 Bulk Retries=0
 Bulk Retries Total Time(ms)=0
 Bulk Total=1
 Bulk Total Time(ms)=48
 Bytes Accepted=9655
 Bytes Received=4000
 Bytes Retried=0
 Bytes Sent=9655
 Documents Accepted=232
 Documents Received=0
 Documents Retried=0
 Documents Sent=232
 Network Retries=0
 Network Total Time(ms)=84
 Node Retries=0
 Scroll Total=0
 Scroll Total Time(ms)=0

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

We just executed our first Hadoop MapReduce job that uses and imports data to Elasticsearch. This MapReduce job simply outputs the count of each word in the Mapper phase, and Reducer calculates the sum of all the counts for each word. We will dig into greater details of how exactly this WordCount program is developed in the next chapter. The console output of the job displays the useful log information to indicate the progress of the job execution. It also displays the ES-Hadoop counters that provide some handy information about the amount of data and documents being sent and received, the number of retries, the time taken, and so on. If you have used the sample.txt file provided in the source zip, you will be able to see that the job found 232 unique words and all of them are pushed as the Elasticsearch document. In the next section, we will examine these documents with the Elasticsearch Head and Marvel plugin that we already installed in Elasticsearch. Note that you can also track the status of your ES-Hadoop MapReduce jobs, similar to any other Hadoop jobs, in the job tracker. In our setup, you can access the job tracker at http://localhost:8088/cluster.

官术网_书友最值得收藏!

Elasticsearch for Hadoop

Running the WordCount example

Getting the examples and building the job JAR file

Importing the test file to HDFS

Running our first job

Tip