官术网_书友最值得收藏!

  • Hadoop Beginner's Guide
  • Garry Turkington
  • 507字
  • 2021-07-29 16:51:40

Time for action – fixing WordCount to work with a combiner

Let's make the necessary modifications to WordCount to correctly use a combiner.

Copy WordCount2.java to a new file called WordCount3.java and change the reduce method as follows:

public void reduce(Text key, Iterable<IntWritable> values,            
Context context) throws IOException, InterruptedException 
{
int total = 0 ;
for (IntWritable val : values))
{
total+= val.get() ;
}
            context.write(key, new IntWritable(total));
}

Remember to also change the class name to WordCount3 and then compile, create the JAR file, and run the job as before.

What just happened?

The output is now as expected. Any map-side invocations of the combiner performs successfully and the reducer correctly produces the overall output value.

Tip

Would this have worked if the original reducer was used as the combiner and the new reduce implementation as the reducer? The answer is no, though our test example would not have demonstrated it. Because the combiner may be invoked multiple times on the map output data, the same errors would arise in the map output if the dataset was large enough, but didn't occur here due to the small input size. Fundamentally, the original reducer was incorrect, but this wasn't immediately obvious; watch out for such subtle logic flaws. This sort of issue can be really hard to debug as the code will reliably work on a development box with a subset of the data set and fail on the much larger operational cluster. Carefully craft your combiner classes and never rely on testing that only processes a small sample of the data.

Reuse is your friend

In the previous section we took the existing job class file and made changes to it. This is a small example of a very common Hadoop development workflow; use an existing job file as the starting point for a new one. Even if the actual mapper and reducer logic is very different, it's often a timesaver to take an existing working job as this helps you remember all the required elements of the mapper, reducer, and driver implementations.

Pop quiz – MapReduce mechanics

Q1. What do you always have to specify for a MapReduce job?

  1. The classes for the mapper and reducer.
  2. The classes for the mapper, reducer, and combiner.
  3. The classes for the mapper, reducer, partitioner, and combiner.
  4. None; all classes have default implementations.

Q2. How many times will a combiner be executed?

  1. At least once.
  2. Zero or one times.
  3. Zero, one, or many times.
  4. It's configurable.

Q3. You have a mapper that for each key produces an integer value and the following set of reduce operations:

  • Reducer A: outputs the sum of the set of integer values.
  • Reducer B: outputs the maximum of the set of values.
  • Reducer C: outputs the mean of the set of values.
  • Reducer D: outputs the difference between the largest and smallest values in the set.

Which of these reduce operations could safely be used as a combiner?

  1. All of them.
  2. A and B.
  3. A, B, and D.
  4. C and D.
  5. None of them.
主站蜘蛛池模板: 桃源县| 左云县| 铜川市| 河津市| 江达县| 汉寿县| 天全县| 衡阳县| 建瓯市| 麻栗坡县| 景德镇市| 洪雅县| 昌吉市| 吴桥县| 繁峙县| 克什克腾旗| 榆社县| 时尚| 昂仁县| 汤原县| 周宁县| 革吉县| 虞城县| 霍山县| 始兴县| 铅山县| 宝应县| 新和县| 招远市| 遂平县| 五河县| 三台县| 辽宁省| 谢通门县| 平和县| 通城县| 新乐市| 永新县| 崇文区| 安泽县| 南昌县|