电子娱乐白菜网

書名： Learning Apache Spark 2
作者名： Muhammad Asif Abbasi
本章字數(shù)： 497字
更新時間： 2021-07-09 18:46:02

Commonly supported file systems

Until now we have mostly focused on the functional aspects of Spark and hence tried to move away from the discussion of filesystems supported by Spark. You might have seen a couple of examples around HDFS, but the primary focus has been local file systems. However, in production environments, it will be extremely rare that you will be working on a local filesystem and chances are that you will be working with distributed file systems such as HDFS and Amazon S3.

Working with HDFS

Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. HDFS provides the ability to store large amounts of data across commodity hardware and companies are already storing massive amounts of data on HDFS by moving it off their traditional database systems and creating data lakes on Hadoop. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only difference being pointing towards the NameNode and the HDFS port.

If you are running Spark on YARN inside a Hadoop cluster, you might not even need to mention the details of NameNode and HDFS, as the path that you will pass will default to HDFS.

Most of the methods that we have seen previously can be used with HDFS. The path to be specified for HDFS is as follows:

hdfs://master:port/filepath

As an example, we have the following settings for our Hadoop cluster:

NameNode Node: hadoopmaster.packtpub.comHDFS Port: 8020File Location: /spark/samples/productsales.csv

The path that you need to specify would be as follows:

hdfs://hadoopmaster.packtpub.com:8020/spark/samples/productsales.csv

Working with Amazon S3

S3 stands for Simple Storage Service, an online storage service provided by Amazon Web Services. As of 2013, Amazon S3 was reported to store more than 2 trillion objects. The core principles of S3 include scalability, high-availability, low-latency, and low-pricing. Notable users of S3 include Netflix, Reddit, Dropbox, Mojang (creators of Minecraft), Tumblr, and Pinterest.

S3 provides amazing speed when your cluster is inside Amazon EC2, but the performance can be a nightmare if you are accessing large amounts of data over public Internet. Accessing S3 data is relatively straightforward as you need a path starting with s3n:// to be passed to Spark's file input methods.

However, before reading from S3, you do need to either set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, or pass them as a part of your path:

Configuring the parameters:

      sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myaccessKeyID")
      sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
      "mySecretAccessKey")
      val data = sc.textFile("s3n://bucket/fileLocation")

Passing the Access Key Id and Secret Key:

 val data = sc.textFile("s3n://MyAccessKeyID:MySecretKey@svr/fileloc")

Having looked at the most common file systems, let's focus our attention on Spark's ability to interact with common databases and structured sources. We've already highlighted Spark's ability to fetch data from CSV and TSV files and loading them to DataFrames. However, it is about time we discuss Spark's ability to interact with databases, which will be covered in much more detail in Chapter 4, Spark SQL.

官术网_书友最值得收藏!

Learning Apache Spark 2

Commonly supported file systems

Working with HDFS

Working with Amazon S3