官术网_书友最值得收藏!

Working with big data

What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names:

B
D
X
A
D
A

We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq:

bash> sort file | uniq -c

The output is as shown ahead:

2 A
1 B
1 D
1 X

However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated.

主站蜘蛛池模板: 尖扎县| 高碑店市| 厦门市| 隆安县| 淮北市| 元朗区| 内乡县| 绥德县| 平潭县| 武邑县| 凭祥市| 洞头县| 达尔| 偃师市| 天水市| 崇阳县| 锦州市| 通城县| 剑河县| 邵武市| 同德县| 阿荣旗| 慈利县| 远安县| 平原县| 八宿县| 南宫市| 大洼县| 鄯善县| 安岳县| 尚义县| 阿巴嘎旗| 铁岭市| 长寿区| 济南市| 浑源县| 开阳县| 遵义市| 峡江县| 新宁县| 赤城县|