官术网_书友最值得收藏!

  • Mastering Spark for Data Science
  • Andrew Morgan Antoine Amend David George Matthew Hallett
  • 174字
  • 2021-07-09 18:49:32

Chapter 2. Data Acquisition

As a data scientist, one of the most important tasks is to load data into your data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data. We walk through a configuration and demonstrate how it delivers vital feed management information under a variety of running conditions.

Readers will learn how to construct a content register and use it to track all input loaded to the system and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process.

In this chapter, we will cover the following topics:

  • Introduce the Global Database of Events, Language, and Tone (GDELT) dataset
  • Data pipelines
  • Universal ingestion framework
  • Real-time monitoring for new data
  • Receiving streaming data via Kafka
  • Registering new content and vaulting for tracking purposes
  • Visualization of content metrics in Kibana to monitor ingestion processes and data health
主站蜘蛛池模板: 利川市| 广州市| 雷山县| 定安县| 和平区| 赤峰市| 夏津县| 广元市| 吉首市| 连山| 都兰县| 大荔县| 大邑县| 牟定县| 漳州市| 阜城县| 武川县| 紫云| 新野县| 淮北市| 巴林右旗| 河曲县| 富宁县| 金川县| 延庆县| 墨脱县| 义乌市| 乳山市| 三穗县| 八宿县| 旌德县| 牡丹江市| 奈曼旗| 荃湾区| 彭阳县| 社旗县| 裕民县| 湘乡市| 宁夏| 千阳县| 东阿县|