官术网_书友最值得收藏!

The SparkSession--your gateway to structured data processing

The SparkSession is the starting point for working with columnar data in Apache Spark. It replaces SQLContext used in previous versions of Apache Spark. It was created from the Spark context and provides the means to load and save data files of different types using DataFrames and Datasets and manipulate columnar data with SQL, among other things. It can be used for the following functions:

  • Executing SQL via the sql method
  • Registering user-defined functions via the udf method
  • Caching
  • Creating DataFrames
  • Creating Datasets
The examples in this chapter are written in Scala as we prefer the language, but you can develop in Python, R, and Java as well. As stated previously, the SparkSession is created from the Spark context.

Using the SparkSession allows you to implicitly convert RDDs into DataFrames or Datasets. For instance, you can convert RDD into a DataFrame or Dataset by calling the toDF or toDS methods:

 import spark.implicits._
val rdd = sc.parallelize(List(1,2,3))
val df = rdd.toDF
val ds = rdd.toDS

As you can see, this is very simple as the corresponding methods are on the RDD object itself.

We are making use of Scala implicits function here because the RDD API wasn't designed with DataFrames or Datasets in mind and is therefore lacking the toDF or toDS methods. However, by importing the respective implicits, this behavior is added on the fly. If you want to learn more about Scala implicits, the following links are recommended:

Next, we will examine some of the supported file formats available to import and save data.

主站蜘蛛池模板: 芦山县| 大庆市| 年辖:市辖区| 石家庄市| 翼城县| 习水县| 理塘县| 新密市| 收藏| 鹰潭市| 南涧| 友谊县| 平乐县| 远安县| 澜沧| 漳平市| 息烽县| 沁源县| 广德县| 广东省| 油尖旺区| 寻乌县| 衡阳县| 锦屏县| 静宁县| 四子王旗| 崇明县| 宜春市| 连江县| 永德县| 定日县| 监利县| 南和县| 绵竹市| 丹凤县| 汨罗市| 潞城市| 平利县| 通城县| 周至县| 固始县|