書名: PySpark Cookbook作者名: Denny Lee Tomasz Drabas本章字數: 187字更新時間: 2021-06-18 19:06:40
Getting ready
This recipe will be reading a tab-delimited (or comma-delimited) file, so please ensure that you have a text (or CSV) file available. For your convenience, you can download the airport-codes-na.txt and departuredelays.csv files from learning http://bit.ly/2nroHbh. Ensure your local Spark cluster can access this file (~/data/flights/airport-codes-na.txt).
If you are running Databricks, the same file is already included in the /databricks-datasets folder; the command is
myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda line: line.split("\t"))
Many of the transformations in the next section will use the RDDs airports or flights; let's set them up by using the following code snippet:
# Setup the RDD: airports
airports = (
sc
.textFile('~/data/flights/airport-codes-na.txt')
.map(lambda element: element.split("\t"))
)
airports.take(5)
# Output
Out[11]:
[[u'City', u'State', u'Country', u'IATA'],
[u'Abbotsford', u'BC', u'Canada', u'YXX'],
[u'Aberdeen', u'SD', u'USA', u'ABR'],
[u'Abilene', u'TX', u'USA', u'ABI'],
[u'Akron', u'OH', u'USA', u'CAK']]
# Setup the RDD: flights
flights = (
sc
.textFile('~/data/flights/departuredelays.csv', minPartitions=8)
.map(lambda line: line.split(","))
)
flights.take(5)
# Output
[[u'date', u'delay', u'distance', u'origin', u'destination'],
[u'01011245', u'6', u'602', u'ABE', u'ATL'],
[u'01020600', u'-8', u'369', u'ABE', u'DTW'],
[u'01021245', u'-2', u'602', u'ABE', u'ATL'],
[u'01020605', u'-4', u'602', u'ABE', u'ATL']]
推薦閱讀
- 玩轉Scratch少兒趣味編程
- 微服務與事件驅動架構
- Visual Basic編程:從基礎到實踐(第2版)
- C語言程序設計基礎與實驗指導
- AngularJS Web Application Development Blueprints
- Learning Elixir
- 大學計算機基礎(第2版)(微課版)
- Mastering Linux Network Administration
- Symfony2 Essentials
- 開源項目成功之道
- R數據科學實戰:工具詳解與案例分析
- HTML+CSS+JavaScript網頁設計從入門到精通 (清華社"視頻大講堂"大系·網絡開發視頻大講堂)
- Sails.js Essentials
- Oracle Database XE 11gR2 Jump Start Guide
- Isomorphic Go