- Learning Apache Spark 2
- Muhammad Asif Abbasi
- 417字
- 2021-07-09 18:46:01
What is ETL?
ELT stands for Extraction, Transformation,and Loading. The term has been around for decades and it represents an industry standard representing the data movement and transformation process to build data pipelines to deliver BI and Analytics. ETL processes are widely used on the data migration and master data management initiatives. Since the focus of our book is on Spark, we'll lightly touch upon the subject of ETL, but will not go into more detail.
Exaction
Extraction is the first part of the ETL process representing the extraction of data from source systems. This is often one of the most important parts of the ETL process, and it sets the stage for further downstream processing. There are a few major things to consider during an extraction process:
- The source system type (RDBMS, NoSQL, FlatFiles, Twitter/Facebook streams)
- The file formats (CSV, JSON, XML, Parquet, Sequence, Object files)
- The frequency of the extract ( Daily, Hourly, Every second)
- The size of the extract
Loading
Once the data is extracted, the next logical step is to load the data into the relevant framework for processing. The objective of loading the data into the relevant framework/tool before transformation is to allow the transformations to happen on the system that is more relevant and performant for such a processing. For example, if you extract data from a system for which Spark does not have a connector, say Ingres database and save it as a text file. Now you may need to do a few transformations before the data is usable. You have two options here: either do the transformations on the file that you have extracted, or first load the data into a framework such as Spark for processing. The benefit of the latter approach is that MPP frameworks like Spark will be much more performant than doing the same processing on the filesystem.
Transformation
Once the data is available inside the framework, you can then apply the relevant transformations. Since the core abstraction within Spark is an RDD, we have already seen the transformations available to RDDs.
Spark provides connectors to certain systems, which essentially combines the process of extraction and loading into a single activity, as it streams the data directly from the source system to Spark. In many cases, since we have a huge variety of source systems available, Spark will not provide you with such connectors, which means you will have to extract the data using the tools made available by the particular system or third-party tools.
- Ansible Configuration Management
- Hands-On Artificial Intelligence on Amazon Web Services
- HBase Design Patterns
- 21天學通C++
- DevOps:Continuous Delivery,Integration,and Deployment with DevOps
- 中國戰略性新興產業研究與發展·增材制造
- Photoshop行業應用基礎
- 單片機技能與實訓
- 人工智能:智能人機交互
- ADuC系列ARM器件應用技術
- 網站規劃與網頁設計
- 案例解說虛擬儀器典型控制應用
- Arduino創意機器人入門:基于ArduBlock(第2版)
- VMware vSphere 6.5 Cookbook(Third Edition)
- 嵌入式系統原理與接口技術