- PySpark Cookbook
- Denny Lee Tomasz Drabas
- 150字
- 2021-06-18 19:06:36
How to do it...
Once you start the PySpark shell via the bash terminal (or you can run the same query within Jupyter notebook), execute the following query:
myRDD = (
sc
.textFile(
'~/data/flights/airport-codes-na.txt'
, minPartitions=4
, use_unicode=True
).map(lambda element: element.split("\t"))
)
If you are running Databricks, the same file is already included in the /databricks-datasets folder; the command is:
myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda element: element.split("\t"))
When running the query:
myRDD.take(5)
The resulting output is:
Out[22]: [[u'City', u'State', u'Country', u'IATA'], [u'Abbotsford', u'BC', u'Canada', u'YXX'], [u'Aberdeen', u'SD', u'USA', u'ABR'], [u'Abilene', u'TX', u'USA', u'ABI'], [u'Akron', u'OH', u'USA', u'CAK']]
Diving in a little deeper, let's determine the number of rows in this RDD. Note that more information on RDD actions such as count() is included in subsequent recipes:
myRDD.count()
# Output
# Out[37]: 527
Also, let's find out the number of partitions that support this RDD:
myRDD.getNumPartitions()
# Output
# Out[33]: 4
推薦閱讀
- x86匯編語言:從實模式到保護模式(第2版)
- Oracle BAM 11gR1 Handbook
- Python:Deeper Insights into Machine Learning
- CodeIgniter Web Application Blueprints
- 深入解析Java編譯器:源碼剖析與實例詳解
- scikit-learn Cookbook(Second Edition)
- Instant Automapper
- Visual C++開發寶典
- 可視化H5頁面設計與制作:Mugeda標準教程
- Kotlin程序員面試算法寶典
- Mastering R for Quantitative Finance
- LiveCode Mobile Development Hotshot
- Mastering Social Media Mining with R
- 寫給設計師的技術書:從智能終端到感知交互
- Instant Windows 8 C++ Application Development How-to