官术网_书友最值得收藏!

Executing the Map Reduce program in a Hadoop cluster

In the previous recipe, we took a look at how to write a map reduce program for a page view counter. In this recipe, we will explore how to execute this in a Hadoop cluster.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE.

How to do it

To execute the program, we first need to create a JAR file of it. JAR stands for Java Archive file, which contains compiled class files. To create a JAR file in eclipse, we need to perform the following steps:

  1. Right-click on the project where you've written your Map Reduce Program. Then, click on Export.
  2. Select Java->Jar File and click on the Next button. Browse through the path where you wish to export the JAR file, and provide a proper name to the jar file. Click on Finish to complete the creation of the JAR file.
  3. Now, copy this file to the Hadoop cluster. If you have your Hadoop cluster running in the AWS EC2 instance, you can use the following command to copy the JAR file:
    scp –i mykey.pem logAnalyzer.jar ubuntu@ec2-52-27-157-247.us-west-2.compute.amazonaws.com:/home/ubuntu
    
  4. If you don't already have your input log files in HDFS, use following commands:
    hadoop fs –mkdir /logs
    hadoop fs –put web.log /logs
    
  5. Now, it's time to execute the map reduce program. Use the following command to start the execution:
    hadoop jar logAnalyzer.jar com.demo.PageViewCounter /logs /pageview_output
    
  6. This will start the Map Reduce execution on your cluster. If everything goes well, you should be able to see output in the pageview_output folder in HDFS. Here, logAnalyzer is the name of the JAR file we created through eclipse. logs is the folder we have our input data in, while pageview_output is the folder that will first be created, and then results will be saved into. It is also important to provide a fully qualified name to the class along with its package name.

How it works...

Once the job is submitted, it first creates the Application Client and Application Master in the Hadoop cluster. The application tasks for Mapper are initiated in each node where data blocks are present in the Hadoop cluster. Once the Mapper phase is complete, the data is locally reduced by a combiner. Once the combiner finishes, the data is shuffled across the nodes in the cluster. Unless all the mappers have finished, reducers cannot be started. Output from the reducers is also written to HDFS in a specified folder.

Note

The output folder to be specified should be a nonexisting folder in HDFS. If the folder is already present, then the program will give you an error.

When all the tasks are finished for the application, you can take a look at the output in HDFS. The following are the commands to do this:

hadoop fs –ls /pageview_output
hadoop fs –cat /pageview_output/part-m-00000

This way, you can write similar programs for the following:

  • Most number of referral sites (hint: use a referral group from the matcher)
  • Number of client errors (with the Http status of 4XX)
  • Number of of server errors (with the Http status of 5XX)
主站蜘蛛池模板: 华池县| 鹿邑县| 麻城市| 读书| 永州市| 墨竹工卡县| 青州市| 吉隆县| 聂拉木县| 射洪县| 正阳县| 衡山县| 长兴县| 大名县| 比如县| 东明县| 天祝| 天柱县| 双江| 湘潭市| 巴林左旗| 金坛市| 湘乡市| 枣强县| 和田县| 翁源县| 武冈市| 孝昌县| 泾阳县| 博兴县| 元氏县| 清流县| 梁山县| 皋兰县| 峨眉山市| 奉贤区| 华坪县| 横峰县| 宁安市| 马尔康县| 平舆县|