- Mastering Spark for Data Science
- Andrew Morgan Antoine Amend David George Matthew Hallett
- 438字
- 2021-07-09 18:49:31
Data technologies
When Hadoop first started, the word Hadoop referred to the combination of HDFS and the MapReduce processing paradigm, as that was the outline of the original paper http://research.google.com/archive/mapreduce.html. Since that time, a plethora of technologies have emerged to complement Hadoop, and with the development of Apache YARN we now see other processing paradigms emerge such as Spark.
Hadoop is now often used as a colloquialism for the entire big data software stack and so it would be prudent at this point to define the scope of that stack for this book. The typical data architecture with a selection of technologies we will visit throughout the book is detailed as follows:

The relationship between these technologies is a dense topic as there are complex interdependencies, for example, Spark depends on GeoMesa, which depends on Accumulo, which depends on Zookeeper and HDFS! Therefore, in order to manage these relationships, there are platforms available, such as Cloudera or Hortonworks HDP http://hortonworks.com/products/sandbox/. These provide consolidated user interfaces and centralized configuration. The choice of platform is that of the reader, however, it is not recommended to install a few of the technologies initially and then move to a managed platform as the version problems encountered will be very complex. Therefore, it is usually easier to start with a clean machine and make a decision upfront as to which direction to take.
All of the software we use in this book is platform-agnostic and therefore fits into the general architecture described earlier. It can be installed independently and it is relatively straightforward to use with single or multiple server environment without the use of a managed product.
The role of Apache Spark
In many ways, Apache Spark is the glue that holds these components together. It increasingly represents the hub of the software stack. It integrates with a wide variety of components but none of them are hard-wired. Indeed, even the underlying storage mechanism can be swapped out. Combining this feature with the ability to leverage different processing frameworks means the original Hadoop technologies effectively become components, rather than an imposing framework. The logical diagram of our architecture appears as follows:

As Spark has gained momentum and wide-scale industry acceptance, many of the original Hadoop implementations for various components have been refactored for Spark. Thus, to add further complexity to the picture, there are often several possible ways to programmatically leverage any particular component; not least the imperative and declarative versions depending upon whether an API has been ported from the original Hadoop Java implementation. We have attempted to remain as true as possible to the Spark ethos throughout the remaining chapters.
- ArchiCAD 19:The Definitive Guide
- Windows XP中文版應用基礎
- 西門子S7-200 SMART PLC實例指導學與用
- 大數據技術與應用
- 3D Printing for Architects with MakerBot
- Photoshop CS3圖層、通道、蒙版深度剖析寶典
- 中國戰略性新興產業研究與發展·工業機器人
- INSTANT Heat Maps in R:How-to
- 實用網絡流量分析技術
- Flink原理與實踐
- 機床電氣控制與PLC
- 基于人工免疫原理的檢測系統模型及其應用
- 基于元胞自動機的人群疏散系統建模與分析
- JSP網絡開發入門與實踐
- EDA技術及其創新實踐(Verilog HDL版)