官术网_书友最值得收藏!

Introduction

In the previous chapter, we discussed the layers of a data-driven system and explained the important storage requirements for each layer. The storage containers in the data layers of AI solutions serve one main purpose: to build and train models that can run in a production environment. In this chapter, we will discuss how to transfer data between the layers in a pipeline so that the data is prepared to be used to train a model to create an actual forecast (called the execution or scoring of the model).

In an Artificial Intelligence (AI) system, data is continuously updated. Once data enters the system via an upload, application program interface (API), or data stream, it has to be stored securely and typically goes through a few ETL steps. In systems that handle streaming data, the incoming data has to be directed into a stable and usable data pipeline. Data transformations have to be managed, scheduled, and orchestrated. Further, the lineage of the data has to be stored to trace back the origins of a data point in a report or application. This chapter explains all data preparation (sometimes called pre-processing) mechanisms that ensure raw data can be used for machine learning by data scientists. This is important since raw data is hardly in a form that can be used by models. We will elaborate on the architecture and technology as explained by the layered model in Chapter 1, Data Storage Fundamentals. To start with, let's pe into the details of ETL.

主站蜘蛛池模板: 永修县| 隆子县| 玉环县| 沽源县| 汨罗市| 海丰县| 郓城县| 桦川县| 清水县| 崇左市| 望城县| 寿阳县| 玉门市| 夏河县| 长治市| 防城港市| 多伦县| 雷州市| 荃湾区| 冀州市| 南皮县| 桃江县| 尉犁县| 深水埗区| 浑源县| 临汾市| 临高县| 毕节市| 东乌| 克拉玛依市| 东阳市| 荆门市| 武平县| 漳州市| 镶黄旗| 乐都县| 新营市| 佛坪县| 敖汉旗| 高淳县| 建湖县|