官术网_书友最值得收藏!

Distributed computing using Apache Hadoop

We are surrounded by devices such as the smart refrigerator, smart watch, phone, tablet, laptops, kiosks at the airport, ATMs dispensing cash to you, and many many more, with the help of which we are now able to do things that were unimaginable just a few years ago. We are so used to applications such as Instagram, Snapchat, Gmail, Facebook, Twitter, and Pinterest that it is next to impossible to go a day without access to such applications. Today, cloud computing has introduced us to the following concepts:

  • Infrastructure as a Service
  • Platform as a Service
  • Software as a Service

Behind the scenes is the world of highly scalable distributed computing, which makes it possible to store and process Petabytes (PB) of data:

  • 1 EB = 1024 PB (50 million Blu-ray movies)
  • 1 PB = 1024 TB (50,000 Blu-ray movies)
  • 1 TB = 1024 GB (50 Blu-ray movies)

The average size of one Blu-ray disc for a movie is ~ 20 GB.

Now, the paradigm of distributed computing is not really a genuinely new topic and has been pursued in some shape or form over decades, primarily at research facilities as well as by a few commercial product companies. Massively parallel processing (MPP) is a paradigm that was in use decades ago in several areas such as oceanography, earthquake monitoring, and space exploration. Several companies, such as Teradata, also implemented MPP platforms and provided commercial products and applications.

Eventually, tech companies such as Google and Amazon, among others, pushed the niche area of scalable distributed computing to a new stage of evolution, which eventually led to the creation of Apache Spark by Berkeley University. Google published a paper on MapReduce as well as Google File System (GFS), which brought the principles of distributed computing to be used by everyone. Of course, due credit needs to be given to Doug Cutting, who made it possible by implementing the concepts given in the Google white papers and introducing the world to Hadoop. The Apache Hadoop framework is an open source software framework written in Java. The two main areas provided by the framework are storage and processing. For storage, the Apache Hadoop framework uses Hadoop Distributed File System (HDFS), which is based on the GFS paper released on October 2003. For processing or computing, the framework depends on MapReduce, which is based on a Google paper on MapReduce released in December 2004 MapReduce framework evolved from V1 (based on JobTracker and TaskTracker) to V2 (based on YARN).

主站蜘蛛池模板: 彰化县| 方城县| 昆明市| 元阳县| 东宁县| 阆中市| 武邑县| 茌平县| 汶川县| 藁城市| 如皋市| 桦南县| 太康县| 乐昌市| 北流市| 商城县| 嵊泗县| 泰和县| 林州市| 鸡西市| 安国市| 阜城县| 瑞丽市| 衡山县| 上犹县| 定远县| 和顺县| 台江县| 黄石市| 蒙自县| 江阴市| 临漳县| 彩票| 祁阳县| 昆明市| 唐海县| 禹城市| 泾川县| 太仆寺旗| 万宁市| 邯郸市|