作者名:Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
本章字數:289字
更新時間:2021-07-09 21:07:44
Benefits of using Spark ML as compared to existing libraries
AMQ Lab at Berkley Evaluated Spark, and RDDs were evaluated through a series of experiments on Amazon EC2 as well as benchmarks of user applications.
Algorithms used: Logistical Regression and k-means
Use case: First iteration, multiple iterations.
All the tests used m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. HDFS was for storage with 256 MB blocks. Refer to the following graph:
The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for Logistical Regression:
The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for K Means clustering algorithm.
The overall results show the following:
Spark outperforms Hadoop by up to 20 times in iterative machine learning and graph applications. The speedup comes from avoiding I/O and deserialization costs by storing data in memory as Java objects.
The applications written perform and scale well. Spark can speed up an analytics report that was running on Hadoop by 40 times.
When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.
Spark was be used to query a 1-TB dataset interactively with latencies of 5-7 seconds.
Spark versus Hadoop for a SORT Benchmark--In 2014, the Databricks team participated in a SORT benchmark test (http://sortbenchmark.org/). This was done on a 100-TB dataset. Hadoop was running in a dedicated data center and a Spark cluster of over 200 nodes was run on EC2. Spark was run on HDFS distributed storage.
Spark was 3 times faster than Hadoop and used 10 times fewer machines. Refer to the following graph: