官术网_书友最值得收藏!

Batch, real-time, and stream processing

Batch processing is used to process data in batches and it reads data input, processes it, and writes it to the output. Apache Hadoop is the most well-known and popular open source implementation of batch processing and a distributed system using the MapReduce paradigm. The data is stored in a shared and distributed filesystem called Hadoop Distributed File System (HDFS), divided into splits, which are the logical data divisions for MapReduce processing. To process these splits using the MapReduce paradigm, the map task reads the splits and passes all of its key/value pairs to a map function and writes the results to intermediate files. After the map phase is completed, the reducer reads intermediate files and passes it to the reduce function. Finally, the reduce task writes results to the final output files. The advantages of the MapReduce model include making distributed programming easier, near-linear speed up, good scalability, as well as fault tolerance. The disadvantage of this batch processing model is being unable to execute recursive or iterative jobs. In addition, the obvious batch behavior is that all inputs must be ready by map before the reduce job starts, which makes MapReduce unsuitable for online and stream processing use cases.

Real-time processing is to process data and get the result almost immediately. This concept in the area of real-time ad hoc queries over big data was first implemented in Dremel by Google. It uses a novel columnar storage format for nested structures with fast index and scalable aggregation algorithms for computing query results in parallel instead of batch sequences. These two techniques are the major characters for real-time processing and are used by similar implementations, such as Cloudera Impala, Facebook Presto, Apache Drill, and Hive on Tez powered by Stinger whose effort is to make a 100x performance improvement over Apache Hive. On the other hand, in-memory computing no doubt offers other solutions for real-time processing. In-memory computing offers very high bandwidth, which is more than 10 gigabytes/second, compared to hard disks' 200 megabytes/second. Also, the latency is comparatively lower, nanoseconds versus milliseconds, compared to hard disks. With the price of RAM going lower and lower each day, in-memory computing is more affordable as real-time solutions, such as Apache Spark, which is a popular open source implementation of in-memory computing. Spark can be easily integrated with Hadoop and the resilient distributed dataset can be generated from data sources such as HDFS and HBase for efficient caching.

Stream processing is to continuously process and act on the live stream data to get a result. In stream processing, there are two popular frameworks: Storm (https://storm.apache.org/) from Twitter and S4 (http://incubator.apache.org/s4/) from Yahoo!. Both the frameworks run on the Java Virtual Machine (JVM) and both process keyed streams. In terms of the programming model, S4 is a program defined as a graph of Processing Elements (PE), small subprograms, and S4 instantiates a PE per key. In short, Storm gives you the basic tools to build a framework, while S4 gives you a well-defined framework.

主站蜘蛛池模板: 乌什县| 邯郸市| 迭部县| 衢州市| 威海市| 礼泉县| 黑河市| 威宁| 通城县| 岳西县| 马山县| 涞源县| 寻甸| 尼勒克县| 遂川县| 越西县| 德昌县| 德庆县| 望谟县| 秦安县| 崇礼县| 阿巴嘎旗| 阿鲁科尔沁旗| 手游| 禹州市| 德保县| 苍南县| 通辽市| 榆树市| 轮台县| 炎陵县| 金山区| 精河县| 沂水县| 大英县| 筠连县| 吉安市| 康保县| 汝南县| 澄江县| 朔州市|