官术网_书友最值得收藏!

Introduction to Large-Scale Machine Learning and Spark

"Information is the oil of the 21 st century, and analytics is the combustion engine." 


                                                                                                         --Peter Sondergaard, Gartner Research

By 2018, it is estimated that companies will spend $114 billion on big data-related projects, an increase of roughly 300%, compared to 2013 (https://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_data_pov_03-02-15.pdf). Much of this increase in expenditure is due to how much data is being created and how we are better able to store such data by leveraging distributed filesystems such as Hadoop.

However, collecting the data is only half the battle; the other half involves data extraction, transformation, and loading into a computation system, which leverage the power of modern computers to apply various mathematical methods in order to learn more about data and patterns, and extract useful information to make relevant decisions. The entire data workflow has been boosted in the last few years by not only increasing the computation power and providing easily accessible and scalable cloud services (for example, Amazon AWS, Microsoft Azure, and Heroku) but also by a number of tools and libraries that help to easily manage, control, and scale infrastructure and build applications. Such a growth in the computation power also helps to process larger amounts of data and to apply algorithms that were impossible to apply earlier. Finally, various computation-expensive statistical or machine learning algorithms have started to help extract nuggets of information from data.

One of the first well-adopted big data technologies was Hadoop, which allows for the  MapReduce computation by saving intermediate results on a disk. However, it still lacks proper big data tools for information extraction. Nevertheless, Hadoop was just the beginning. With the growing size of machine memory, new in-memory computation frameworks appeared, and they also started to provide basic support for conducting data analysis and modeling—for example, SystemML or Spark ML for Spark and FlinkML for Flink. These frameworks represent only the tip of the iceberg—there is a lot more in the big data ecosystem, and it is permanently evolving, since the volume of data is constantly growing, demanding new big data algorithms and processing methods. For example, the Internet of Things (IoT) represents a new domain that produces huge amount of streaming data from various sources (for example, home security system, Alexa Echo, or vital sensors) and brings not only an unlimited potential to mind useful information from data, but also demands new kind of data processing and modeling methods.

Nevertheless, in this chapter, we will start from the beginning and explain the following topics:

  • Basic working tasks of data scientists
  • Aspect of big data computation in distributed environment
  • The big data ecosystem
  • Spark and its machine learning support
主站蜘蛛池模板: 麻江县| 武城县| 什邡市| 宣威市| 团风县| 武胜县| 黔西县| 诏安县| 北碚区| 尉犁县| 两当县| 宁都县| 慈溪市| 巴中市| 平邑县| 辽宁省| 北流市| 武山县| 灵丘县| 江都市| 达州市| 英德市| 平顶山市| 盐亭县| 合川市| 扎囊县| 图片| 深水埗区| 垣曲县| 崇信县| 通城县| 温宿县| 民乐县| 苍山县| 漳平市| 金平| 仙桃市| 浦东新区| 昔阳县| 金门县| 青州市|