官术网_书友最值得收藏!

MapReduce functionality

All MapReduce programs/modules operate in two phases, as follows:

  • Map phase: This is the first phase. In the map phase, a set of data is converted into another set of data, where individual elements are broken into tuples (key-value pairs).
  • Reduce phase: This is the second phase, where the output from the map phase is taken as input and merges data tuples into a smaller set of tuples. 

There is a JobTracker that divides a given problem into multiple map tasks. These tasks are distributed across the network to a number of slave nodes, for parallel processing. These slave nodes are referred to as TaskTrackersGenerally, map tasks operate on the same cluster nodes, where the processed data remains. If that server node is already heavily loaded, another node that is close to the data will be chosen. Let's examine the work process of MapReduce, as shown in the following diagram:

Figure 2.3: Illustration of how MapReduce works

The preceding diagram shows a brief overview of how the MapReduce algorithm works. There are different phases involved. Assuming that there is a problem that needs to be solved by the MapReduce program, the program should execute in the order shown in the figure. Let's inspect each phase in detail, as follows:

  • Input phase: In the input phase, a record reader interprets each record in an input file and sends the parsed data to the mapper, in the form of key-value pairs. This is the first step in the MapReduce module. 
  • Mapper: A mapper is a user-defined program module that uses a series of key-value pairs and processes each of them, in order to generate processed key-value pairs as the output. 
  • Intermediate keys: The mapper consumes the key-value pairs and outputs processed key-value pairs. The key-value pairs generated by the mappers are referred to as the intermediate keys. 
  • Combiner: There is a local reducer that groups similar data from the mapper into identifiable sets. They are often referred to as a combiner. This is an optional phase that may or may not be present in any particular MapReduce subroutine.
  • Shuffle and sort: In the shuffling and sorting phase, the output from the mapper phase is consumed as the input. There is usually a large amount of middle data to be moved from all of the map nodes to all of the reduce nodes in the shuffle phase. The shuffle phase transfers data from the mapper disks, rather than their main memories, and the intermediate output will be sorted by keys, so that all pairs with the same keys will be grouped together. The data from the local map nodes is transferred to the reduce nodes through the network. 
  • Reducer: The reducer consumes the grouped key-value paired data as input and executes a reducer function on each pair. There are zero or more key-value pairs as the output from the reducer function. This output is redirected to the final step of the MapReduce module. 
  • Output phase: There is an output formatter that translates the final key-value pairs from the reducer function and writes them into a file, using a record writer. The output file contains the final output of the subroutine. 
主站蜘蛛池模板: 东乌珠穆沁旗| 大余县| 华亭县| 鹤峰县| 枣强县| 苗栗市| 施秉县| 盐亭县| 大足县| 西华县| 东乡县| 肇州县| 宜都市| 富蕴县| 鲁甸县| 许昌市| 亚东县| 澳门| 体育| 福泉市| 龙里县| 金昌市| 西充县| 莆田市| 桑日县| 辽中县| 南安市| 措美县| 基隆市| 河北区| 陵水| 保德县| 武邑县| 阿克陶县| 云霄县| 宣汉县| 邯郸县| 苍南县| 磐安县| 汾西县| 白沙|