官术网_书友最值得收藏!

  • Mastering Hadoop
  • Sandeep Karanth
  • 226字
  • 2021-08-06 19:53:00

The RecordReader class

Unlike InputSplit, the RecordReader class presents a record view of the data to the Map task. RecordReader works within each InputSplit class and generates records from the data in the form of key-value pairs. The InputSplit boundary is a guideline for RecordReader and is not enforced. On one extreme, a custom RecordReader class can be written to read an entire file (though this is not encouraged). Most often, a RecordReader class will have to read from a subsequent InputSplit class to present the complete record to the Map task. This happens when records overlap InputSplit classes.

The reading of bytes from a subsequent InputSplit class happens via the FSDataInputS tream objects. Though this reading does not respect locality in itself, generally, it gathers only a few bytes from the next split and there is not a significant performance overhead. But in some cases where record sizes are huge, this can have a bearing on the performance due to significant byte transfers across nodes.

In the following diagram, a file with two HDFS blocks has the record R5 spanning both blocks. It is assumed that the minimum split size is less than the block size. In this case, RecordReader is going to gather the complete record by reading bytes off the next block of data.

The RecordReader class

File with two blocks and record R5 spanning blocks

主站蜘蛛池模板: 新源县| 辰溪县| 霍林郭勒市| 静宁县| 玛曲县| 玉山县| 高阳县| 广平县| 衡阳市| 金寨县| 莲花县| 临邑县| 韶山市| 大新县| 邻水| 荔浦县| 惠州市| 英山县| 台前县| 隆安县| 漳州市| 内黄县| 旬阳县| 朝阳区| 安塞县| 汝南县| 襄城县| 普兰县| 阜城县| 游戏| 时尚| 山东省| 华容县| 抚顺市| 鲁山县| 江陵县| 神池县| 尤溪县| 阿拉尔市| 广河县| 崇阳县|