- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 153字
- 2021-06-18 19:06:38
.zipWithIndex() transformation
The zipWithIndex() transformation appends (or ZIPs) the RDD with the element indices. This is very handy when wanting to remove the header row (first row) of a file.
Look at the following code snippet:
# View each row within RDD + the index
# i.e. output is in form ([row], idx)
ac = airports.map(lambda c: (c[0], c[3]))
ac.zipWithIndex().take(5)
This will generate this result:
# Output
[((u'City', u'IATA'), 0),
((u'Abbotsford', u'YXX'), 1),
((u'Aberdeen', u'ABR'), 2),
((u'Abilene', u'ABI'), 3),
((u'Akron', u'CAK'), 4)]
To remove the header from your data, you can use the following code:
# Using zipWithIndex to skip header row
# - filter out row 0
# - extract only row info
(
ac
.zipWithIndex()
.filter(lambda (row, idx): idx > 0)
.map(lambda (row, idx): row)
.take(5)
)
The preceding code will skip the header, as shown as follows:
# Output
[(u'Abbotsford', u'YXX'),
(u'Aberdeen', u'ABR'),
(u'Abilene', u'ABI'),
(u'Akron', u'CAK'),
(u'Alamosa', u'ALS')]
推薦閱讀
- Puppet 4 Essentials(Second Edition)
- 程序員修煉之道:程序設計入門30講
- Java Web開發學習手冊
- Intel Galileo Essentials
- 移動UI設計(微課版)
- CockroachDB權威指南
- GameMaker Programming By Example
- C/C++程序員面試指南
- Scala for Machine Learning(Second Edition)
- 基于SpringBoot實現:Java分布式中間件開發入門與實戰
- Python爬蟲、數據分析與可視化:工具詳解與案例實戰
- 軟件工程基礎與實訓教程
- Practical Microservices
- Unity 2017 Game AI Programming(Third Edition)
- Android系統下Java編程詳解