官术网_书友最值得收藏!

MapReduce

Massive parallel processing of large datasets is a complex process. MapReduce simplifies this by providing a design pattern that instructs algorithms to be expressed in map and reduce phases. Map can be used to perform simple transformations on data, and reduce is used to group data together and perform aggregations.

By chaining together a number of map and reduce phases, sophisticated algorithms can be achieved. The shared nothing architecture of MapReduce prohibits communication between map tasks of the same phase or reduces tasks of the same phase. Communication that's required happens at the end of each phase.

The simplicity of this model allows Hadoop to translate each phase, depending on the amount of data that needs to be processed into tens or even hundreds of tasks being executed in parallel, thus achieving scalable performance.

Internally, the map and reduce tasks follow a simplistic data representation. Everything is a key or a value. A map task receives key-value pairs and applies basic transformations emitting new key-value pairs. Data is then partitioned and different partitions are transmitted to different reduce tasks. A reduce task also receives key-value pairs, groups them based on the key, and applies basic transformation to those groups.

A MapReduce example

To illustrate how MapReduce works, let's look at an example of a log file of total size 1 GB with the following format:

INFO      MyApp  - Entering application.
WARNING   com.foo.Bar - Timeout accessing DB - Retrying
ERROR     com.foo.Bar  - Did it again!
INFO      MyApp  - Exiting application

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Once this file is stored in HDFS, it is split into eight 128 MB blocks and distributed in multiple Hadoop nodes. In order to build a MapReduce job to count the amount of INFO, WARNING, and ERROR log lines in the file, we need to think in terms of map and reduce phases.

In one map phase, we can read local blocks of the file and map each line to a key and a value. We can use the log level as the key and the number 1 as the value. After it is completed, data is partitioned based on the key and transmitted to the reduce tasks.

MapReduce guarantees that the input to every reducer is sorted by key. Shuffle is the process of sorting and copying the output of the map tasks to the reducers to be used as input. By setting the value to 1 on the map phase, we can easily calculate the total in the reduce phase. Reducers receive input sorted by key, aggregate counters, and store results.

In the following diagram, every green block represents an INFO message, every yellow block a WARNING message, and every red block an ERROR message:

Implementing the preceding MapReduce algorithm in Java requires the following three classes:

  • A Map class to map lines into <key,value> pairs; for example, <"INFO",1>
  • A Reduce class to aggregate counters
  • A Job configuration class to define input and output types for all <key,value> pairs and the input and output files
主站蜘蛛池模板: 溧水县| 肇州县| 天津市| 昌图县| 定西市| 蓬莱市| 乌拉特前旗| 永修县| 陆川县| 保德县| 韶山市| 富平县| 常州市| 永兴县| 嘉峪关市| 无棣县| 布拖县| 大宁县| 公主岭市| 聂荣县| 巴彦县| 朝阳县| 鲜城| 体育| 共和县| 德昌县| 博兴县| 仙游县| 海阳市| 阿荣旗| 日土县| 施甸县| 禹城市| 临泉县| 米林县| 宁化县| 油尖旺区| 甘谷县| 讷河市| 克拉玛依市| 教育|