官术网_书友最值得收藏!

Processing the Parquet files

Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:

df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")

This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.

In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.

主站蜘蛛池模板: 电白县| 安岳县| 凤阳县| 罗田县| 镇赉县| 鸡西市| 邢台县| 会东县| 肇源县| 衡阳县| 贡山| 宁陵县| 伊金霍洛旗| 武山县| 明星| 民丰县| 湖北省| 隆安县| 越西县| 简阳市| 迭部县| 吴桥县| 磐安县| 巩义市| 聊城市| 井冈山市| 许昌县| 会昌县| 仙游县| 松原市| 加查县| 六枝特区| 清涧县| 安乡县| 金山区| 镇沅| 三门峡市| 安义县| 泽州县| 武威市| 亳州市|