- Learning Apache Spark 2
- Muhammad Asif Abbasi
- 497字
- 2021-07-09 18:46:02
Commonly supported file systems
Until now we have mostly focused on the functional aspects of Spark and hence tried to move away from the discussion of filesystems supported by Spark. You might have seen a couple of examples around HDFS, but the primary focus has been local file systems. However, in production environments, it will be extremely rare that you will be working on a local filesystem and chances are that you will be working with distributed file systems such as HDFS and Amazon S3.
Working with HDFS
Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. HDFS provides the ability to store large amounts of data across commodity hardware and companies are already storing massive amounts of data on HDFS by moving it off their traditional database systems and creating data lakes on Hadoop. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only difference being pointing towards the NameNode
and the HDFS port.
If you are running Spark on YARN inside a Hadoop cluster, you might not even need to mention the details of NameNode
and HDFS, as the path that you will pass will default to HDFS.
Most of the methods that we have seen previously can be used with HDFS. The path to be specified for HDFS is as follows:
hdfs://master:port/filepath
As an example, we have the following settings for our Hadoop cluster:
NameNode Node: hadoopmaster.packtpub.comHDFS Port: 8020File Location: /spark/samples/productsales.csv
The path that you need to specify would be as follows:
hdfs://hadoopmaster.packtpub.com:8020/spark/samples/productsales.csv
Working with Amazon S3
S3 stands for Simple Storage Service, an online storage service provided by Amazon Web Services. As of 2013, Amazon S3 was reported to store more than 2 trillion objects. The core principles of S3 include scalability, high-availability, low-latency, and low-pricing. Notable users of S3 include Netflix, Reddit, Dropbox, Mojang (creators of Minecraft), Tumblr, and Pinterest.
S3 provides amazing speed when your cluster is inside Amazon EC2, but the performance can be a nightmare if you are accessing large amounts of data over public Internet. Accessing S3 data is relatively straightforward as you need a path starting with s3n://
to be passed to Spark's file input methods.
However, before reading from S3, you do need to either set the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables, or pass them as a part of your path:
- Configuring the parameters:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myaccessKeyID") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "mySecretAccessKey") val data = sc.textFile("s3n://bucket/fileLocation")
- Passing the Access Key Id and Secret Key:
val data = sc.textFile("s3n://MyAccessKeyID:MySecretKey@svr/fileloc")
Having looked at the most common file systems, let's focus our attention on Spark's ability to interact with common databases and structured sources. We've already highlighted Spark's ability to fetch data from CSV and TSV files and loading them to DataFrames. However, it is about time we discuss Spark's ability to interact with databases, which will be covered in much more detail in Chapter 4, Spark SQL.
- 腦動力:Linux指令速查效率手冊
- 火格局的時空變異及其在電網(wǎng)防火中的應用
- 計算機圖形學
- 基于LabWindows/CVI的虛擬儀器設計與應用
- STM32G4入門與電機控制實戰(zhàn):基于X-CUBE-MCSDK的無刷直流電機與永磁同步電機控制實現(xiàn)
- 工業(yè)機器人現(xiàn)場編程(FANUC)
- 統(tǒng)計策略搜索強化學習方法及應用
- 電腦主板現(xiàn)場維修實錄
- Prometheus監(jiān)控實戰(zhàn)
- Excel 2007技巧大全
- 分數(shù)階系統(tǒng)分析與控制研究
- 運動控制系統(tǒng)
- INSTANT Heat Maps in R:How-to
- AI的25種可能
- Mastering Geospatial Analysis with Python