官术网_书友最值得收藏!

The ETL process

The previous stages in the big data processing field evolved over several decades under the name of data mining, and then adopted the popular name of big data.

One of the best outcomes of these disciplines is the specification of the Extraction, Transform, Load (ETL) process.

This process starts with a mix of many data sources from business systems, then moves to a system that transforms the data into a readable state, and then finishes by generating a data mart with very structured and documented data types.

For the sake of applying this concept, we will mix the elements of this process with the final outcome of a structured dataset, which includes in its final form an additional label column (in the case of supervised learning problems).

This process is depicted in the following diagram: 

Depiction of the ETL process, from raw data to a useful dataset

The diagram illustrates the first stages of the data pipeline, starting with all the organization's data, whether it is commercial transactions, IoT device raw values, or other valuable data sources' information elements, which are commonly in very different types and compositions. The ETL process is in charge of gathering the raw information from them using different software filters, applying the necessary transforms to arrange the data in a useful manner, and finally, presenting the data in tabular format (we can think of this as a single database table with a last feature or result column, or a big CSV file with consolidated data). The final result can be conveniently used by the following processes without practically thinking of the many quirks of data formatting, because they have been standardized into a very clear table structure.

主站蜘蛛池模板: 勐海县| 井陉县| 西丰县| 吉林省| 汾西县| 沐川县| 卓尼县| 武穴市| 石嘴山市| 连城县| 西昌市| 砚山县| 福建省| 高安市| 桃源县| 大新县| 武威市| 新巴尔虎右旗| 宁晋县| 金门县| 西峡县| 梓潼县| 平凉市| 那坡县| 潜江市| 铜梁县| 兴义市| 大理市| 日照市| 揭东县| 大邑县| 望谟县| 林口县| 讷河市| 岳池县| 庆元县| 阿勒泰市| 天柱县| 定襄县| 大关县| 兴国县|