官术网_书友最值得收藏!

Benefits of using Spark ML as compared to existing libraries

AMQ Lab at Berkley Evaluated Spark, and RDDs were evaluated through a series of experiments on Amazon EC2 as well as benchmarks of user applications.

  • Algorithms used: Logistical Regression and k-means
  • Use case: First iteration, multiple iterations.

All the tests used m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. HDFS was for storage with 256 MB blocks. Refer to the following graph:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for Logistical Regression:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for K Means clustering algorithm.

The overall results show the following:

  • Spark outperforms Hadoop by up to 20 times in iterative machine learning and graph applications. The speedup comes from avoiding I/O and deserialization costs by storing data in memory as Java objects.
  • The applications written perform and scale well. Spark can speed up an analytics report that was running on Hadoop by 40 times.
  • When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.
  • Spark was be used to query a 1-TB dataset interactively with latencies of 5-7 seconds.

Spark versus Hadoop for a SORT Benchmark--In 2014, the Databricks team participated in a SORT benchmark test (http://sortbenchmark.org/). This was done on a 100-TB dataset. Hadoop was running in a dedicated data center and a Spark cluster of over 200 nodes was run on EC2. Spark was run on HDFS distributed storage.

Spark was 3 times faster than Hadoop and used 10 times fewer machines. Refer to the following graph:

主站蜘蛛池模板: 广水市| 兴山县| 青田县| 呼玛县| 隆回县| 株洲市| 中山市| 平乐县| 衡东县| 丁青县| 绿春县| 呼图壁县| 股票| 万宁市| 阳山县| 如皋市| 衡南县| 林芝县| 额尔古纳市| 上饶市| 静乐县| 邵阳市| 门源| 万全县| 安陆市| 昌黎县| 浦城县| 石嘴山市| 湖南省| 安岳县| 谷城县| 辽阳市| 东港市| 加查县| 昭通市| 甘泉县| 和龙市| 井研县| 威宁| 衡山县| 晋城|