官术网_书友最值得收藏!

  • Mastering Spark for Data Science
  • Andrew Morgan Antoine Amend David George Matthew Hallett
  • 221字
  • 2021-07-09 18:49:34

Summary

In this chapter, we have seen why datasets should always be thoroughly understood before too much exploration work is undertaken. We have discussed the details of structured data and dimensional modeling, particularly with respect to how this applies to the GDELT dataset, and have expanded the GKG model to show its underlying complexity.

We have explained the difference between the traditional ETL and newer schema-on-read ELT techniques, and have touched upon some of the issues that data engineers face regarding data storage, compression, and data formats - specifically the advantages and implementations of Avro and Parquet. We have also demonstrated that there are several ways to explore data using the various Spark API, including examples of how to use SQL on the Spark shell.

We can conclude this chapter by mentioning that the code in our repository pulls everything together and is a full model for reading in raw GKG files (use the Apache NiFi GDELT data ingest pipeline from Chapter 1, Data Acquisition if you require some data).

In the next chapter, we will pe deeper into the GKG model by exploring the techniques used to explore and analyze data at scale. We will see how to develop and enrich our GKG data model using SQL, and investigate how Apache Zeppelin notebooks can provide a richer data science experience.

主站蜘蛛池模板: 广水市| 寿光市| 武宁县| 江山市| 三门县| 沅江市| 英吉沙县| 三穗县| 洪雅县| 马鞍山市| 青阳县| 健康| 云龙县| 东海县| 光山县| 筠连县| 遂川县| 会宁县| 清远市| 佛坪县| 聂拉木县| 宁乡县| 陵川县| 景宁| 桐庐县| 绥滨县| 闸北区| 南澳县| 山丹县| 兴文县| 兴山县| 平和县| 定陶县| 湘西| 个旧市| 绍兴县| 洛川县| 五华县| 阿鲁科尔沁旗| 达拉特旗| 滨州市|