官术网_书友最值得收藏!

  • Spark Cookbook
  • Rishi Yadav
  • 178字
  • 2021-07-16 13:44:01

Introduction

Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.

Hadoop supported storage means a storage format that can work with Hadoop's InputFormat and OutputFormat interfaces. InputFormat is responsible for creating InputSplits from input data and piding it further into records. OutputFormat is responsible for writing to storage.

We will start with writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. In the next recipe, we will cover how to use any InputFormat interface to load data in Spark. We will also explore loading data stored in Amazon S3, a leading cloud storage platform.

We will explore loading data from Apache Cassandra, which is a NoSQL database. Finally, we will explore loading data from a relational database.

主站蜘蛛池模板: 耒阳市| 改则县| 大宁县| 邹城市| 玉环县| 通辽市| 漳平市| 保康县| 原阳县| 竹山县| 西畴县| 大新县| 瑞金市| 金昌市| 佳木斯市| 遵化市| 水城县| 化隆| 东阿县| 西和县| 三原县| 大同市| 平度市| 丘北县| 西贡区| 卫辉市| 南康市| 方山县| 延寿县| 新余市| 宜兴市| 桦川县| 浦城县| 化德县| 南涧| 双流县| 巴东县| 象州县| 兴业县| 修武县| 无锡市|