官术网_书友最值得收藏!

  • PySpark Cookbook
  • Denny Lee Tomasz Drabas
  • 153字
  • 2021-06-18 19:06:36

.textFile(...) method

To read the file, we are using SparkContext's textFile() method via this command:

(
sc
.textFile(
'~/data/flights/airport-codes-na.txt'
, minPartitions=4
, use_unicode=True
)
)

Only the first parameter is required, which indicates the location of the text file as per ~/data/flights/airport-codes-na.txt. There are two optional parameters as well:

  • minPartitions: Indicates the minimum number of partitions that make up the RDD. The Spark engine can often determine the best number of partitions based on the file size, but you may want to change the number of partitions for performance reasons and, hence, the ability to specify the minimum number.
  • use_unicode: Engage this parameter if you are processing Unicode data.

Note that if you were to execute this statement without the subsequent map() function, the resulting RDD would not reference the tab-delimiter—basically a list of strings that is:

myRDD = sc.textFile('~/data/flights/airport-codes-na.txt')
myRDD.take(5)

# Out[35]: [u'City\tState\tCountry\tIATA', u'Abbotsford\tBC\tCanada\tYXX', u'Aberdeen\tSD\tUSA\tABR', u'Abilene\tTX\tUSA\tABI', u'Akron\tOH\tUSA\tCAK']
主站蜘蛛池模板: 屏山县| 阿拉尔市| 清流县| 元氏县| 四川省| 基隆市| 周至县| 东明县| 普兰店市| 泰来县| 昌平区| 庄浪县| 名山县| 岐山县| 永新县| 会理县| 江川县| 邹城市| 资兴市| 资兴市| 沁水县| 五原县| 阜平县| 吴堡县| 巧家县| 衡水市| 驻马店市| 湾仔区| 十堰市| 北碚区| 五原县| 九寨沟县| 大兴区| 金门县| 三亚市| 崇阳县| 绵阳市| 上虞市| 武强县| 句容市| 科尔|