官术网_书友最值得收藏!

  • Learning Apache Spark 2
  • Muhammad Asif Abbasi
  • 497字
  • 2021-07-09 18:46:02

Commonly supported file systems

Until now we have mostly focused on the functional aspects of Spark and hence tried to move away from the discussion of filesystems supported by Spark. You might have seen a couple of examples around HDFS, but the primary focus has been local file systems. However, in production environments, it will be extremely rare that you will be working on a local filesystem and chances are that you will be working with distributed file systems such as HDFS and Amazon S3.

Working with HDFS

Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. HDFS provides the ability to store large amounts of data across commodity hardware and companies are already storing massive amounts of data on HDFS by moving it off their traditional database systems and creating data lakes on Hadoop. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only difference being pointing towards the NameNode and the HDFS port.

If you are running Spark on YARN inside a Hadoop cluster, you might not even need to mention the details of NameNode and HDFS, as the path that you will pass will default to HDFS.

Most of the methods that we have seen previously can be used with HDFS. The path to be specified for HDFS is as follows:

hdfs://master:port/filepath

As an example, we have the following settings for our Hadoop cluster:

NameNode Node: hadoopmaster.packtpub.comHDFS Port: 8020File Location: /spark/samples/productsales.csv

The path that you need to specify would be as follows:

hdfs://hadoopmaster.packtpub.com:8020/spark/samples/productsales.csv

Working with Amazon S3

S3 stands for Simple Storage Service, an online storage service provided by Amazon Web Services. As of 2013, Amazon S3 was reported to store more than 2 trillion objects. The core principles of S3 include scalability, high-availability, low-latency, and low-pricing. Notable users of S3 include Netflix, Reddit, Dropbox, Mojang (creators of Minecraft), Tumblr, and Pinterest.

S3 provides amazing speed when your cluster is inside Amazon EC2, but the performance can be a nightmare if you are accessing large amounts of data over public Internet. Accessing S3 data is relatively straightforward as you need a path starting with s3n:// to be passed to Spark's file input methods.

However, before reading from S3, you do need to either set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, or pass them as a part of your path:

  • Configuring the parameters:
      sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myaccessKeyID")
      sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
      "mySecretAccessKey")
      val data = sc.textFile("s3n://bucket/fileLocation")
  • Passing the Access Key Id and Secret Key:
 val data = sc.textFile("s3n://MyAccessKeyID:MySecretKey@svr/fileloc")

Having looked at the most common file systems, let's focus our attention on Spark's ability to interact with common databases and structured sources. We've already highlighted Spark's ability to fetch data from CSV and TSV files and loading them to DataFrames. However, it is about time we discuss Spark's ability to interact with databases, which will be covered in much more detail in Chapter 4, Spark SQL.

主站蜘蛛池模板: 甘洛县| 商河县| 旅游| 荣昌县| 虞城县| 开封市| 清水河县| 石狮市| 襄城县| 姚安县| 辽宁省| 清远市| 砀山县| 樟树市| 东乡| 大石桥市| 德江县| 景宁| 牡丹江市| 泽州县| 新沂市| 丹江口市| 马山县| 剑河县| 永昌县| 九龙城区| 黄大仙区| 西畴县| 宁强县| 清徐县| 密云县| 佛坪县| 封丘县| 交城县| 博爱县| 巩留县| 浦北县| 金湖县| 彭阳县| 永川市| 满洲里市|