官术网_书友最值得收藏!

.zipWithIndex() transformation

The zipWithIndex() transformation appends (or ZIPs) the RDD with the element indices. This is very handy when wanting to remove the header row (first row) of a file.

Look at the following code snippet:

# View each row within RDD + the index 
# i.e. output is in form ([row], idx)
ac = airports.map(lambda c: (c[0], c[3]))
ac.zipWithIndex().take(5)

This will generate this result:

# Output
[((u'City', u'IATA'), 0),
((u'Abbotsford', u'YXX'), 1),
((u'Aberdeen', u'ABR'), 2),
((u'Abilene', u'ABI'), 3),
((u'Akron', u'CAK'), 4)]

To remove the header from your data, you can use the following code:

# Using zipWithIndex to skip header row
# - filter out row 0
# - extract only row info
(
ac
.zipWithIndex()
.filter(lambda (row, idx): idx > 0)
.map(lambda (row, idx): row)
.take(5)
)

The preceding code will skip the header, as shown as follows:

# Output
[(u'Abbotsford', u'YXX'),
(u'Aberdeen', u'ABR'),
(u'Abilene', u'ABI'),
(u'Akron', u'CAK'),
(u'Alamosa', u'ALS')]
主站蜘蛛池模板: 晋中市| 通河县| 安岳县| 阳城县| 焦作市| 泰兴市| 岳西县| 镇坪县| 六枝特区| 资阳市| 扬州市| 洱源县| 海门市| 黄陵县| 望江县| 电白县| 囊谦县| 衡阳市| 远安县| 手游| 仁寿县| 颍上县| 二手房| 伊春市| 上饶市| 浪卡子县| 昆山市| 抚宁县| 海城市| 德安县| 靖西县| 泸州市| 罗平县| 绥阳县| 德令哈市| 井研县| 蓬安县| 丁青县| 彭州市| 铜鼓县| 顺昌县|