官术网_书友最值得收藏!

.reduce(...) action

The reduce(f) action aggregates the elements of an RDD by f. The f function should be commutative and associative so that it can be computed correctly in parallel. Look at the following code:

# Calculate the total delays of flights
# between SEA (origin) and SFO (dest),
# convert delays column to int
# and summarize
flights\
.filter(lambda c: c[3] == 'SEA' and c[4] == 'SFO')\
.map(lambda c: int(c[1]))\
.reduce(lambda x, y: x + y)

This will produce the following result:

# Output
22293

We need to make an important note here, however. When using reduce(), the reducer function needs to be associative and commutative; that is, a change in the order of elements and operands does not change the result.

Associativity rule: (6 + 3) + 4 = 6 + (3 + 4)
Commutative rule:  6 + 3 + 4 = 4 + 3 + 6

Error can occur if you ignore the aforementioned rules.

As an example, see the following RDD (with one partition only!):

data_reduce = sc.parallelize([1, 2, .5, .1, 5, .2], 1)

Reducing data to pide the current result by the subsequent one, we would expect a value of 10:

works = data_reduce.reduce(lambda x, y: x / y)

Partitioning the data into three partitions will produce an incorrect result:

data_reduce = sc.parallelize([1, 2, .5, .1, 5, .2], 3) data_reduce.reduce(lambda x, y: x / y)

It will produce 0.004.

主站蜘蛛池模板: 白朗县| 二连浩特市| 南充市| 海伦市| 秭归县| 安徽省| 晋中市| 旅游| 玉门市| 建瓯市| 阳新县| 建瓯市| 石家庄市| 贵德县| 淳安县| 房产| 竹山县| 西乌| 安岳县| 延吉市| 望江县| 久治县| 嘉荫县| 海口市| 西峡县| 台东县| 额尔古纳市| 深圳市| 海南省| 东兴市| 尼玛县| 灵石县| 盘锦市| 冷水江市| 洱源县| 朝阳区| 图木舒克市| 洮南市| 苍南县| 嘉黎县| 德安县|