- Mastering Spark for Data Science
- Andrew Morgan Antoine Amend David George Matthew Hallett
- 174字
- 2021-07-09 18:49:32
Chapter 2. Data Acquisition
As a data scientist, one of the most important tasks is to load data into your data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data. We walk through a configuration and demonstrate how it delivers vital feed management information under a variety of running conditions.
Readers will learn how to construct a content register and use it to track all input loaded to the system and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process.
In this chapter, we will cover the following topics:
- Introduce the Global Database of Events, Language, and Tone (GDELT) dataset
- Data pipelines
- Universal ingestion framework
- Real-time monitoring for new data
- Receiving streaming data via Kafka
- Registering new content and vaulting for tracking purposes
- Visualization of content metrics in Kibana to monitor ingestion processes and data health
推薦閱讀
- Clojure Data Analysis Cookbook
- 工業(yè)機(jī)器人產(chǎn)品應(yīng)用實戰(zhàn)
- 協(xié)作機(jī)器人技術(shù)及應(yīng)用
- 西門子S7-200 SMART PLC從入門到精通
- Visual C# 2008開發(fā)技術(shù)詳解
- Photoshop CS3圖層、通道、蒙版深度剖析寶典
- 分?jǐn)?shù)階系統(tǒng)分析與控制研究
- 網(wǎng)站前臺設(shè)計綜合實訓(xùn)
- Visual FoxPro程序設(shè)計
- Photoshop CS5圖像處理入門、進(jìn)階與提高
- 30天學(xué)通Java Web項目案例開發(fā)
- 天才與算法:人腦與AI的數(shù)學(xué)思維
- Learning Couchbase
- Flash 8中文版全程自學(xué)手冊
- 探索中國物聯(lián)網(wǎng)之路