捕鱼大决战下载

書名： Mastering Spark for Data Science
作者名： Andrew Morgan Antoine Amend David George Matthew Hallett
本章字數： 221字
更新時間： 2021-07-09 18:49:34

Summary

In this chapter, we have seen why datasets should always be thoroughly understood before too much exploration work is undertaken. We have discussed the details of structured data and dimensional modeling, particularly with respect to how this applies to the GDELT dataset, and have expanded the GKG model to show its underlying complexity.

We have explained the difference between the traditional ETL and newer schema-on-read ELT techniques, and have touched upon some of the issues that data engineers face regarding data storage, compression, and data formats - specifically the advantages and implementations of Avro and Parquet. We have also demonstrated that there are several ways to explore data using the various Spark API, including examples of how to use SQL on the Spark shell.

We can conclude this chapter by mentioning that the code in our repository pulls everything together and is a full model for reading in raw GKG files (use the Apache NiFi GDELT data ingest pipeline from Chapter 1, Data Acquisition if you require some data).

In the next chapter, we will pe deeper into the GKG model by exploring the techniques used to explore and analyze data at scale. We will see how to develop and enrich our GKG data model using SQL, and investigate how Apache Zeppelin notebooks can provide a richer data science experience.

官术网_书友最值得收藏!

Mastering Spark for Data Science

Summary