官术网_书友最值得收藏!

Parquet

Apache Parquet is a columnar storage format specifically designed for the Hadoop ecosystem. Traditional row-based storage formats are optimized to work with one record at a time, meaning they can be slow for certain types of workload. Instead, Parquet serializes and stores data by column, thus allowing for optimization of storage, compression, predicate processing, and bulk sequential access across large datasets - exactly the type of workload suited to Spark!

As Parquet implements per column data compaction, it's particularly suited to CSV data, especially with fields of low cardinality, and file sizes can see huge reductions when compared to Avro.

+--------------------------+--------------+ 
|                 File Type|          Size| 
+--------------------------+--------------+ 
|20160101020000.gkg.csv    |      20326266| 
|20160101020000.gkg.avro   |      13557119| 
|20160101020000.gkg.parquet|       6567110| 
|20160101020000.gkg.csv.bz2|       4028862| 
+--------------------------+--------------+ 

Parquet also integrates with Avro natively. Parquet takes an Avro in-memory representation of data and maps to its internal data types. It then serializes the data to disk using the Parquet columnar file format.

We have seen how to apply Avro to the model, now we can take the next step and use this Avro model to persist data to disk via the Parquet format. Again, we will show the current method and then some lower-level code for demonstrative purposes. First, the recommended method:

val gdeltAvroDF = spark 
    .read
    .format("com.databricks.spark.avro")
    .load("/path/to/avro/output")

gdeltAvroDF.write.parquet("/path/to/parquet/output")

Now for the detail behind how Avro and Parquet relate to each other:

val inputFile = new File("("/path/to/avro/output ")
 val outputFile = new Path("/path/to/parquet/output")
 
 val schema = Specification.getClassSchema
 val reader =  new GenericDatumReader[IndexedRecord](schema)
 val avroFileReader = DataFileReader.openReader(inputFile, reader)

 val parquetWriter =
     new AvroParquetWriter[IndexedRecord](outputFile, schema)

 while(avroFileReader.hasNext)  {
     parquetWriter.write(dataFileReader.next())
 }

 
 dataFileReader.close()
 parquetWriter.close()
   

As before, the lower-level code is quite verbose, although it does give some insight into the various steps required. You can find the full code in our repository.

We now have a great model to store and retrieve our GKG data that uses Avro and Parquet and can easily be implemented using DataFrames.

主站蜘蛛池模板: 贵港市| 吉安县| 万载县| 来凤县| 安达市| 中方县| 朝阳区| 沙坪坝区| 宣武区| 竹山县| 犍为县| 连山| 道真| 静乐县| 巢湖市| 讷河市| 辽宁省| 郎溪县| 青川县| 昌邑市| 西畴县| 广州市| 会昌县| 南召县| 元谋县| 叶城县| 福安市| 丰城市| 启东市| 佛学| 庆安县| 灵武市| 井冈山市| 陕西省| 通化市| 张家港市| 郧西县| 尚义县| 五原县| 桓仁| 象州县|