官术网_书友最值得收藏!

MapReduce in the mongo shell

One of the most interesting features, which has been underappreciated and not widely supported throughout MongoDB history, is the ability to write MapReduce natively using the shell.

MapReduce is a data processing method for getting aggregation results from large sets of data. The main advantage is that it is inherently parallelizable as evidenced by frameworks such as Hadoop.

MapReduce is really useful when used to implement a data pipeline. Multiple MapReduce commands can be chained to produce different results. An example would be aggregating data by different reporting periods (hour, day, week, month, year) where we use the output of each more granular reporting period to produce a less granular report.

A simple example of MapReduce in our examples would be as follows, given that our input books collection is as follows:

> db.books.find()
{ "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30 }
{ "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50 }
{ "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40 }

And our map and reduce functions are defined as follows:

> var mapper = function() {
emit(this.id, 1);
};

In this mapper, we simply output a key of the id of each document with a value of 1:

> var reducer = function(id, count) {
return Array.sum(count);
};

In the reducer, we sum across all values (where each one has a value of 1):

> db.books.mapReduce(mapper, reducer, { out:"books_count" });
{
"result" : "books_count",
"timeMillis" : 16613,
"counts" : {
"input" : 3,
"emit" : 3,
"reduce" : 1,
"output" : 1
},
"ok" : 1
}
> db.books_count.find()
{ "_id" : null, "value" : 3 }
>

Our final output is a document with no ID, since we didn't output any value for id, and a value of 6, since there are six documents in the input dataset.

Using MapReduce, MongoDB will apply map to each input document, emitting key-value pairs at the end of the map phase. Then each reducer will get  key-value pairs with the same key as input, processing all multiple values. The reducer's output will be a single key-value pair for each key.

Optionally, we can use a finalize function to further process the results of the mapper and reducer. MapReduce functions use JavaScript and run within the mongod process. MapReduce can output inline as a single document, subject to the 16 MB document size limit, or as multiple documents in an output collection. Input and output collections can be sharded.

主站蜘蛛池模板: 和田县| 贵州省| 吉木萨尔县| 邵阳县| 遂昌县| 河津市| 合山市| 福州市| 四子王旗| 孟州市| 正阳县| 家居| 嵊泗县| 贵定县| 北安市| 新兴县| 德江县| 班戈县| 土默特左旗| 曲水县| 东辽县| 南丰县| 措美县| 当雄县| 曲靖市| 易门县| 成安县| 沁水县| 汝南县| 平阳县| 滨海县| 南宁市| 临高县| 庄浪县| 绵阳市| 湖口县| 长春市| 昔阳县| 德江县| 如东县| 清原|