官术网_书友最值得收藏!

Set operations in Spark

For those of you who are from the database world and have now ventured into the world of big data, you're probably looking at how you can possibly apply set operations on Spark datasets. You might have realized that an RDD can be a representation of any sort of data, but it does not necessarily represent a set based data. The typical set operations in a database world include the following operations, and we'll see how some of these apply to Spark. However, it is important to remember that while Spark offers some of the ways to mimic these operations, spark doesn't allow you to apply conditions to these operations, which is common in SQL operations:

  • Distinct: Distinct operation provides you a non-duplicated set of data from the dataset
  • Intersection: The intersection operations returns only those elements that are available in both datasets
  • Union: A union operation returns the elements from both datasets
  • Subtract: A subtract operation returns the elements from one dataset by taking away all the matching elements from the second dataset
  • Cartesian: A Cartesian product of both datasets

Distinct()

During data management and analytics, working on a distinct non-duplicated set of data is often critical. Spark offers the ability to extract distinct values from a dataset using the available transformation operations. Let's look at the ways you can collect distinct elements in Scala, Python, and Java.

Example 2.10: Distinct in Scala:

val movieList = sc.parallelize(List("A Nous Liberte","Airplane","The Apartment","The Apartment")) moviesList.distinct().collect() 

Example 2.11: Distinct in Python:

movieList = sc.parallelize(["A Nous Liberte","Airplane","The Apartment","The Apartment"]) movieList.distinct().collect() 

Example 2.12: Distinct in Java:

JavaRDD<String> movieList = sc.parallelize(Arrays.asList("A Nous Liberte","Airplane","The Apartment","The Apartment")); 
     
movieList.distinct().collect(); 

Intersection()

Intersection is similar to an inner join operation with the caveat that it doesn't allow joining criteria. Intersection looks at elements from both RDDs and returns the elements that are available across both data sets. For example, you might have candidates based on skillset:

java_skills = "Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh" db_skills = "James Kent", "Paul Jones", Tom Mahoney", "Adam Waugh" java_and_db_skills = java_skills.intersection(db_skills)

Figure 2.20: Intersection operation

Let's look at examples of intersection of two datasets in Scala, Python, and Java.

Example 2.13: Intersection in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.intersection(db_skills).collect() 

Example 2.14: Intersection in Python:

java_skills= sc.parallelize(["Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh"]) db_skills= sc.parallelize(["James Kent","Paul Jones","Tom Mahoney","Adam Waugh"]) java_skills.intersection(db_skills).collect()

Example 2.15: Intersection in Java:

JavaRDD<String> javaSkills= sc.parallelize(Arrays.asList("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")); 
JavaRDD<String> dbSkills= sc.parallelize(Arrays.asList("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")); 
javaSkills.intersection(dbSkills).collect(); 

Union()

Union is basically an aggregation of both the datasets. If few data elements are available across both datasets, they will be duplicated. If we look at the data from the previous examples, you have people like Tom Mahoney and Paul Jones having both the Java and DB skills. A union of the two datasets will result in a two entries of them. We'll only look at a Scala example in this case.

Example 2.16: Union in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.union(db_skills).collect() //The Result shown would be like: Tom Mahoney, Alicia Whitekar, Paul Jones, Rodney Marsh, James Kent, Paul Jones, Tom Mahoney, Adam Waugh.

Figure 2.21: Union operation

Subtract()

Subtraction as the name indicates, removes the elements of one dataset from the other. Subtraction is very useful in ETL operations to identify new data arriving on successive days, and making sure you identify the new data items before doing the integration. Let's take a quick look at the Scala example, and view the results of the operation.

Example 2.17: The subtract() in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.subtract(db_skills).collect() 

Results: Alicia Whitekar, Rodney Marsh.

Cartesian()

Cartesian simulates the cross-join from an SQL system, and basically gives you all the possible combinations between the elements of the two datasets. For example, you might have 12 months of the year, and a total of five years, and you wanted to look at all the possible dates where you need to perform a particular operation. Here's how you would generate the data using the cartesian() transformation in spark:

Figure 2.22: Cartesian product in Scala

Tip

For brevity, we are not going to give the examples for Python and Java here, but you can find them at this book's GitHub page.

主站蜘蛛池模板: 镇坪县| 阜城县| 毕节市| 绥中县| 建水县| 柳林县| 会同县| 米泉市| 海兴县| 枝江市| 景德镇市| 阿城市| 衡水市| 淮滨县| 铜陵市| 衢州市| 卢氏县| 丘北县| 沈丘县| 四川省| 全椒县| 开阳县| 历史| 丰镇市| 项城市| 铁岭市| 岳西县| 辽中县| 策勒县| 元氏县| 洪洞县| 玉田县| 沙雅县| 临武县| 驻马店市| 浦城县| 沁源县| 肃宁县| 阳曲县| 错那县| 息烽县|