官术网_书友最值得收藏!

How to do it...

Once you start the PySpark shell via the bash terminal (or you can run the same query within Jupyter notebook), execute the following query:

myRDD = (
sc
.textFile(
'~/data/flights/airport-codes-na.txt'
, minPartitions=4
, use_unicode=True
).map(lambda element: element.split("\t"))
)

If you are running Databricks, the same file is already included in the /databricks-datasets folder; the command is:

myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda element: element.split("\t"))

When running the query:

myRDD.take(5)

The resulting output is:

Out[22]:  [[u'City', u'State', u'Country', u'IATA'], [u'Abbotsford', u'BC', u'Canada', u'YXX'], [u'Aberdeen', u'SD', u'USA', u'ABR'], [u'Abilene', u'TX', u'USA', u'ABI'], [u'Akron', u'OH', u'USA', u'CAK']]

Diving in a little deeper, let's determine the number of rows in this RDD. Note that more information on RDD actions such as count() is included in subsequent recipes:

myRDD.count()

# Output
# Out[37]: 527

Also, let's find out the number of partitions that support this RDD:

myRDD.getNumPartitions()

# Output
# Out[
33]: 4
主站蜘蛛池模板: 噶尔县| 会泽县| 碌曲县| 垣曲县| 新安县| 蒲城县| 航空| 嘉义县| 华阴市| 句容市| 富裕县| 保德县| 普兰县| 汶上县| 堆龙德庆县| 抚州市| 桦南县| 平陆县| 射洪县| 巴林左旗| 南木林县| 永泰县| 南投市| 瑞昌市| 德江县| 扶余县| 福海县| 仙居县| 宝鸡市| 阳谷县| 永顺县| 永平县| 宣化县| 化州市| 科尔| 肥东县| 全南县| 光泽县| 宜丰县| 丰县| 乌拉特前旗|