書名： PySpark Cookbook
作者名： Denny Lee Tomasz Drabas
本章字?jǐn)?shù)： 104字
更新時(shí)間： 2021-06-18 19:06:39

.reduceByKey(...) transformation

The reduceByKey(f) transformation reduces the elements of the RDD using f by the key. The f function should be commutative and associative so that it can be computed correctly in parallel.

Look at the following code snippet:

# Determine delays by originating city
# - remove header row via zipWithIndex() 
#   and map() 
(
    flights
    .zipWithIndex()
    .filter(lambda (row, idx): idx > 0)
    .map(lambda (row, idx): row)
    .map(lambda c: (c[3], int(c[1])))
    .reduceByKey(lambda x, y: x + y)
    .take(5)
)

This will generate the following output:

# Output
[(u'JFK', 387929),  
 (u'MIA', 169373),  
 (u'LIH', -646),  
 (u'LIT', 34489),  
 (u'RDM', 3445)]

官术网_书友最值得收藏!

PySpark Cookbook

.reduceByKey(...) transformation