官术网_书友最值得收藏!

MapReduce abstractions

This simple MapReduce example requires more than 50 lines of Java code (mostly because of infrastructure and boilerplate code). In SQL, a similar implementation would just require the following:

SELECT level, count(*) FROM table GROUP BY level

Hive is a technology originating from Facebook that translates SQL commands, such as the preceding one, into sets of map and reduce phases. SQL offers convenient ubiquity, and it is known by almost everyone.

However, SQL is declarative and expresses the logic of a computation without describing its control flow. So, there are use cases that will be unusual to implement in SQL, and some problems are too complex to be expressed in relational algebra. For example, SQL handles joins naturally, but it has no built-in mechanism for splitting data into streams and applying different operations to each substream.

Pig is a technology originating from Yahoo that offers a relational data-flow language. It is procedural, supports splits, and provides useful operators for joining and grouping data. Code can be inserted anywhere in the data flow and is appealing because it is easy to read and learn.

However, Pig is a purpose-built language; it excels at simple data flows, but it is inefficient for implementing non-trivial algorithms.

In Pig, the same example can be implemented as follows:

LogLine    = load 'file.logs' as (level, message);
LevelGroup = group LogLine by level;
Result     = foreach LevelGroup generate group, COUNT(LogLine);
store Result into 'Results.txt';

Both Pig and Hive support extra functionality through loadable user-defined functions (UDF) implemented in Java classes.

Cascading is implemented in Java and designed to be expressive and extensible. It is based on the design pattern of pipelines that many other technologies follow. The pipeline is inspired from the original chain of responsibility design pattern and allows ordered lists of actions to be executed. It provides a Java-based API for data-processing flows.

Developers with functional programming backgrounds quickly introduced new domain specific languages that leverage its capabilities. Scalding, Cascalog, and PyCascading are popular implementations on top of Cascading, which are implemented in programming languages such as Scala, Clojure, and Python.

主站蜘蛛池模板: 定南县| 江源县| 正阳县| 宜章县| 古田县| 大安市| 吉首市| 朔州市| 莱西市| 洱源县| 古交市| 贵港市| 马龙县| 治多县| 泰州市| 龙泉市| 丽江市| 察雅县| 咸阳市| 册亨县| 天镇县| 沙田区| 体育| 尼勒克县| 手机| 威信县| 呈贡县| 洮南市| 四平市| 上杭县| 利津县| 潼关县| 灵石县| 孟津县| 佛坪县| 海城市| 彩票| 洛川县| 塔河县| 开平市| 上栗县|