官术网_书友最值得收藏!

  • Hadoop Beginner's Guide
  • Garry Turkington
  • 266字
  • 2021-07-29 16:51:40

Time for action – WordCount the easy way

Let's revisit WordCount, but this time use some of these predefined map and reduce implementations:

  1. Create a new WordCountPredefined.java file containing the following code:
    import org.apache.hadoop.conf.Configuration ;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper ;
    import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer ;
    
    public class WordCountPredefined
    {   
        public static void main(String[] args) throws Exception
        {
            Configuration conf = new Configuration();
            Job job = new Job(conf, "word count1");
            job.setJarByClass(WordCountPredefined.class);
            job.setMapperClass(TokenCounterMapper.class);
            job.setReducerClass(IntSumReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }
  2. Now compile, create the JAR file, and run it as before.
  3. Don't forget to delete the output directory before running the job, if you want to use the same location. Use the hadoop fs -rmr output, for example.

What just happened?

Given the ubiquity of WordCount as an example in the MapReduce world, it's perhaps not entirely surprising that there are predefined Mapper and Reducer implementations that together realize the entire WordCount solution. The TokenCounterMapper class simply breaks each input line into a series of (token, 1) pairs and the IntSumReducer class provides a final count by summing the number of values for each key.

There are two important things to appreciate here:

  • Though WordCount was doubtless an inspiration for these implementations, they are in no way specific to it and can be widely applicable
  • This model of having reusable mapper and reducer implementations is one thing to remember, especially in combination with the fact that often the best starting point for a new MapReduce job implementation is an existing one
主站蜘蛛池模板: 准格尔旗| 武义县| 肃宁县| 宣城市| 积石山| 苏尼特左旗| 太保市| 繁昌县| 渭源县| 四子王旗| 余江县| 平远县| 云南省| 汪清县| 舞钢市| 嵩明县| 子长县| 威信县| 石棉县| 留坝县| 三台县| 横山县| 紫云| 塘沽区| 普洱| 城口县| 江达县| 冕宁县| 吉水县| 高淳县| 清苑县| 涪陵区| 滨海县| 浪卡子县| 武宁县| 龙陵县| 淮南市| 文水县| 沈丘县| 监利县| 都匀市|