- Practical Big Data Analytics
- Nataraj Dasgupta
- 339字
- 2021-07-02 19:26:27
Block size and number of mappers and reducers
An important consideration in the MapReduce process is an understanding of HDFS block size, that is, the size of the chunks into which the files have been split. A MapReduce task that needs to access a certain file will need to perform the map operation on each block representing the file. For example, given a 512 MB file and a 128 MB block size, four blocks would be needed to store the entire file. Hence, a MapReduce operation will at a minimum require four map tasks whereby each map operation would be applied to each subset of the data (that is, each of the four blocks).
If the file was very large, however, and required say, 10,000 blocks to store, this means we would have required 10,000 map operations. But, if we had only 10 servers, then we'd have to send 1,000 map operations to each server. This might be sub-optimal as it can lead to a high penalty due to disk I/O operations and resource allocation settings on a per-map basis.
The number of reducers required is summarized very elegantly on Hadoop Wiki (https://wiki.apache.org/hadoop/HowManyMapsAndReduces).
The ideal reducers should be the optimal value that gets them closest to:
* A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible
Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:
* Terrible performance on the next phase of the workflow * Terrible performance due to the shuffle * Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless * Destroying disk IO for no really sane reason * Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work
- 現(xiàn)代測控系統(tǒng)典型應(yīng)用實(shí)例
- 腦動力:Linux指令速查效率手冊
- 人工免疫算法改進(jìn)及其應(yīng)用
- Getting Started with Clickteam Fusion
- 圖解PLC控制系統(tǒng)梯形圖和語句表
- Cloud Analytics with Microsoft Azure
- PIC單片機(jī)C語言非常入門與視頻演練
- Learning Apache Cassandra(Second Edition)
- 視覺檢測技術(shù)及智能計算
- 讓每張照片都成為佳作的Photoshop后期技法
- Pig Design Patterns
- WordPress Theme Development Beginner's Guide(Third Edition)
- Windows Server 2008 R2活動目錄內(nèi)幕
- 智能生產(chǎn)線的重構(gòu)方法
- 教育機(jī)器人的風(fēng)口:全球發(fā)展現(xiàn)狀及趨勢