官术网_书友最值得收藏!

Partitioner

The partitioner takes the intermediate key/value pairs from the mapper (or combiner if it is being used) and splits them up into shards, one shard per reducer.

Each map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function, is given the key and the number of reducers and returns the index of the desired reducer.

A typical default is to hash the key and use the hash value to module the number of reducers:

partitionId = hash(key) % R, where R is number of Reducers

It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes; otherwise the MapReduce operation can be held up waiting for slow reducers to finish (that is, the reducers assigned the larger shares of the skewed data).

Between the map and reduce stages, the data is shuffled (parallel sorted and then exchanged between nodes) in order to move the data from the map node that produced them to the shard in which they will be reduced. The shuffle can sometimes take longer than the computation time, depending on network bandwidth, CPU speeds, data produced, and time taken by map and reduce computations.

By default, the partitioner computes the hash code of each object, which is typically an md5 checksum. Then, it randomly distributes the keyspace evenly over the reducers, but still ensures that keys with the same values in different mappers end up at the same reducer. The default behavior of the partitioner can be customized with operations such as sorting. The partitioned data is written to the local filesystem for each map task and waits to be pulled by its corresponding reducer.

主站蜘蛛池模板: 湘西| 共和县| 原平市| 衢州市| 鹿泉市| 永城市| 扶风县| 宁城县| 平和县| 望城县| 安乡县| 永修县| 分宜县| 镶黄旗| 马边| 拉孜县| 渭南市| 衡阳市| 长宁县| 黄冈市| 安庆市| 托里县| 江永县| 县级市| 图木舒克市| 页游| 织金县| 瑞安市| 柯坪县| 郓城县| 东乌珠穆沁旗| 宁城县| 日照市| 张家口市| 保山市| 黄浦区| 合山市| 剑阁县| 垦利县| 南和县| 永泰县|