- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 136字
- 2021-07-02 18:55:28
Processing the Parquet files
Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:
df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")
This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.
In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.
推薦閱讀
- 高手是如何做產品設計的(全2冊)
- 觸·心:DT時代的大數據精準營銷
- Java完全自學教程
- Dynamics 365 Application Development
- CouchDB and PHP Web Development Beginner’s Guide
- MySQL數據庫管理與開發(慕課版)
- Microsoft System Center Orchestrator 2012 R2 Essentials
- OpenResty完全開發指南:構建百萬級別并發的Web應用
- Extreme C
- OpenMP核心技術指南
- Julia數據科學應用
- Delphi開發典型模塊大全(修訂版)
- Python應用開發技術
- 零基礎學編程系列(全5冊)
- C語言程序設計