書名： Mastering Spark for Data Science
作者名： Andrew Morgan Antoine Amend David George Matthew Hallett
本章字?jǐn)?shù)： 174字
更新時間： 2021-07-09 18:49:32

Chapter 2. Data Acquisition

As a data scientist, one of the most important tasks is to load data into your data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data. We walk through a configuration and demonstrate how it delivers vital feed management information under a variety of running conditions.

Readers will learn how to construct a content register and use it to track all input loaded to the system and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process.

In this chapter, we will cover the following topics:

Introduce the Global Database of Events, Language, and Tone (GDELT) dataset
Data pipelines
Universal ingestion framework
Real-time monitoring for new data
Receiving streaming data via Kafka
Registering new content and vaulting for tracking purposes
Visualization of content metrics in Kibana to monitor ingestion processes and data health

官术网_书友最值得收藏!

Mastering Spark for Data Science

Chapter 2. Data Acquisition