- Artificial Intelligence for Big Data
- Anand Deshpande Manish Kumar
- 298字
- 2021-06-25 21:57:06
Batch processing
Traditionally, the data processing pipeline within data warehousing systems consisted of Extracting, Transforming, and Loading the data for analysis and actions (ETL). With the new paradigm of file-based distributed computing, there has been a shift in the ETL process sequence. Now the data is Extracted, Loaded, and Transformed repetitively for analysis (ELTTT) a number of times:
In batch processing, the data is collected from various sources in the staging areas and loaded and transformed with defined frequencies and schedules. In most use cases with batch processing, there is no critical need to process the data in real time or in near real time. As an example, the monthly report on a student's attendance data will be generated by a process (batch) at the end of a calendar month. This process will extract the data from source systems, load it, and transform it for various views and reports. One of the most popular batch processing frameworks is Apache Hadoop. It is a highly scalable, distributed/parallel processing framework. The primary building block of Hadoop is the Hadoop Distributed File System.
As the name suggests, this is a wrapper filesystem which stores the data (structured/unstructured/semi-structured) in a distributed manner on data nodes within Hadoop. The processing that is applied on the data (instead of the data that is processed) is sent to the data on various nodes. Once the compute is performed by an inpidual node, the results are consolidated by the master process. In this paradigm of data-compute localization, Hadoop relies heavily on intermediate I/O operations on hard drive disks. As a result, extremely large volumes of data can be processed by Hadoop in a reliable manner at the cost of processing time. This framework is very suitable for extracting value from Big Data in batch mode.
- Redis使用手冊
- 數(shù)據(jù)存儲架構(gòu)與技術(shù)
- 程序員修煉之道:從小工到專家
- 大數(shù)據(jù)技術(shù)基礎(chǔ)
- PyTorch深度學(xué)習(xí)實戰(zhàn):從新手小白到數(shù)據(jù)科學(xué)家
- 算法競賽入門經(jīng)典:習(xí)題與解答
- Access 2007數(shù)據(jù)庫應(yīng)用上機指導(dǎo)與練習(xí)
- 卷積神經(jīng)網(wǎng)絡(luò)的Python實現(xiàn)
- 大數(shù)據(jù)可視化
- Learn Unity ML-Agents:Fundamentals of Unity Machine Learning
- 智能數(shù)據(jù)分析:入門、實戰(zhàn)與平臺構(gòu)建
- 數(shù)據(jù)庫技術(shù)及應(yīng)用教程
- 一個64位操作系統(tǒng)的設(shè)計與實現(xiàn)
- 深入淺出 Hyperscan:高性能正則表達式算法原理與設(shè)計
- PostgreSQL指南:內(nèi)幕探索