官术网_书友最值得收藏!

Set operations in Spark

For those of you who are from the database world and have now ventured into the world of big data, you're probably looking at how you can possibly apply set operations on Spark datasets. You might have realized that an RDD can be a representation of any sort of data, but it does not necessarily represent a set based data. The typical set operations in a database world include the following operations, and we'll see how some of these apply to Spark. However, it is important to remember that while Spark offers some of the ways to mimic these operations, spark doesn't allow you to apply conditions to these operations, which is common in SQL operations:

  • Distinct: Distinct operation provides you a non-duplicated set of data from the dataset
  • Intersection: The intersection operations returns only those elements that are available in both datasets
  • Union: A union operation returns the elements from both datasets
  • Subtract: A subtract operation returns the elements from one dataset by taking away all the matching elements from the second dataset
  • Cartesian: A Cartesian product of both datasets

Distinct()

During data management and analytics, working on a distinct non-duplicated set of data is often critical. Spark offers the ability to extract distinct values from a dataset using the available transformation operations. Let's look at the ways you can collect distinct elements in Scala, Python, and Java.

Example 2.10: Distinct in Scala:

val movieList = sc.parallelize(List("A Nous Liberte","Airplane","The Apartment","The Apartment")) moviesList.distinct().collect() 

Example 2.11: Distinct in Python:

movieList = sc.parallelize(["A Nous Liberte","Airplane","The Apartment","The Apartment"]) movieList.distinct().collect() 

Example 2.12: Distinct in Java:

JavaRDD<String> movieList = sc.parallelize(Arrays.asList("A Nous Liberte","Airplane","The Apartment","The Apartment")); 
     
movieList.distinct().collect(); 

Intersection()

Intersection is similar to an inner join operation with the caveat that it doesn't allow joining criteria. Intersection looks at elements from both RDDs and returns the elements that are available across both data sets. For example, you might have candidates based on skillset:

java_skills = "Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh" db_skills = "James Kent", "Paul Jones", Tom Mahoney", "Adam Waugh" java_and_db_skills = java_skills.intersection(db_skills)

Figure 2.20: Intersection operation

Let's look at examples of intersection of two datasets in Scala, Python, and Java.

Example 2.13: Intersection in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.intersection(db_skills).collect() 

Example 2.14: Intersection in Python:

java_skills= sc.parallelize(["Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh"]) db_skills= sc.parallelize(["James Kent","Paul Jones","Tom Mahoney","Adam Waugh"]) java_skills.intersection(db_skills).collect()

Example 2.15: Intersection in Java:

JavaRDD<String> javaSkills= sc.parallelize(Arrays.asList("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")); 
JavaRDD<String> dbSkills= sc.parallelize(Arrays.asList("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")); 
javaSkills.intersection(dbSkills).collect(); 

Union()

Union is basically an aggregation of both the datasets. If few data elements are available across both datasets, they will be duplicated. If we look at the data from the previous examples, you have people like Tom Mahoney and Paul Jones having both the Java and DB skills. A union of the two datasets will result in a two entries of them. We'll only look at a Scala example in this case.

Example 2.16: Union in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.union(db_skills).collect() //The Result shown would be like: Tom Mahoney, Alicia Whitekar, Paul Jones, Rodney Marsh, James Kent, Paul Jones, Tom Mahoney, Adam Waugh.

Figure 2.21: Union operation

Subtract()

Subtraction as the name indicates, removes the elements of one dataset from the other. Subtraction is very useful in ETL operations to identify new data arriving on successive days, and making sure you identify the new data items before doing the integration. Let's take a quick look at the Scala example, and view the results of the operation.

Example 2.17: The subtract() in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.subtract(db_skills).collect() 

Results: Alicia Whitekar, Rodney Marsh.

Cartesian()

Cartesian simulates the cross-join from an SQL system, and basically gives you all the possible combinations between the elements of the two datasets. For example, you might have 12 months of the year, and a total of five years, and you wanted to look at all the possible dates where you need to perform a particular operation. Here's how you would generate the data using the cartesian() transformation in spark:

Figure 2.22: Cartesian product in Scala

Tip

For brevity, we are not going to give the examples for Python and Java here, but you can find them at this book's GitHub page.

主站蜘蛛池模板: 桐城市| 镇雄县| 怀仁县| 都昌县| 青州市| 姜堰市| 鹿泉市| 忻州市| 布拖县| 淮南市| 聊城市| 依兰县| 朔州市| 抚顺县| 读书| 南部县| 芮城县| 赣州市| 安多县| 柘荣县| 陵川县| 太仓市| 河池市| 临夏县| 东至县| 都昌县| 大冶市| 潮安县| 沙坪坝区| 察哈| 张北县| 汾西县| 鹿泉市| 山丹县| 土默特左旗| 侯马市| 牙克石市| 林甸县| 饶河县| 临夏县| 任丘市|