官术网_书友最值得收藏!

ETL

ETL is the standard term that is used for Extracting, Transforming, and Loading data. In traditional data warehousing systems, the entire data pipeline consists of multiple ETL steps that follow after each other to bring the data from the source to the target (usually a report on a dashboard). Let's explore this in more detail:

E: Data is extracted from a source. This can be a file, a database, or a direct call to an API or web service. Once loaded with a query, the data is kept in memory, ready to be transformed. For example, a daily export file from a source system that produces client orders is read every day at 01:00.

T: The data that was captured in memory during the extraction phase (or in the loading phase with ELT) is transformed using calculations, aggregations, and/or filters into a target dataset. For example, the customer order data is cleaned, enriched, and narrowed down per region.

L: The data that was transformed is loaded (stored) into a data store.

This completes an ETL step. Similarly, in ELT, all the extracted data gets stored in the data store and then later transformed.

The following figure is an example of a full data pipeline, from a source system to a trained model:

Figure 3.1: An example of a typical ETL data pipeline

In modern systems such as data lakes, the ETL chain is often replaced by ELT. Rather than having, say, five ETL steps, where data is slowly refined and made ready for analysis, from a raw format to a queryable form, all the data is loaded into one large data store. Then, a series of transformations that are mostly virtual (not stored on disk) runs directly on top of the stored data to produce a similar outcome for analytics. In this way, a gain in storage space and performance can be achieved since modern (cloud-based) storage systems are capable of handling massive amounts of data. The data pipeline becomes somewhat simpler, although the various T(Transform) steps still have to be managed as separate software pieces:

Figure 3.2: An example of an ELT data pipeline

In the remainder of this chapter, we will look in detail at the ETL and ELT steps. Use the text and exercises to form a good understanding of the possibilities for preparing your data. Remember that there is no silver bullet; every use case will have specific needs when it comes to data processing and storage. There are many tools and techniques that can be used to get data from A to B; pick the ones that suit your company best, and whatever you pick, never forget the best practices of software development, such as version control, test-driven development, clean code, documentation, and common sense.

主站蜘蛛池模板: 大足县| 龙口市| 大安市| 克什克腾旗| 土默特右旗| 瑞丽市| 泰宁县| 册亨县| 乌恰县| 和硕县| 桓仁| 镇赉县| 潜山县| 宁明县| 岐山县| 牟定县| 文登市| 宁安市| 清水河县| 永德县| 和政县| 弥渡县| 清丰县| 丰都县| 栾城县| 海南省| 化州市| 五莲县| 将乐县| 辉县市| 南陵县| 霍城县| 云南省| 常宁市| 长宁县| 隆回县| 麻江县| 松阳县| 得荣县| 虎林市| 凤冈县|