書名： Big Data Analytics with Hadoop 3
作者名： Sridhar Alla
本章字?jǐn)?shù)： 136字
更新時(shí)間： 2021-06-25 21:26:15

Shuffle and sort

Once the mappers are done with the input data processing (essentially, splitting the data and generating key/value pairs), the output has to be distributed across the cluster to start the reduce tasks. Hence, a reduce task starts with the shuffle and sort step, by taking the output files written by all of the mappers and subsequent partitioners and downloads them to the local machine in which the reducer task is running. These inpidual data pieces are then sorted by key into one larger list of key/value pairs. The purpose of this sort is to group equivalent keys together, so that their values can be iterated over easily in the reduce task. The framework handles everything automatically, with the ability for the custom code to control how the keys are sorted and grouped.

官术网_书友最值得收藏!

Big Data Analytics with Hadoop 3

Shuffle and sort