Set Operations

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

Set Operations

Spark also provides functionality similar to the native Python set operations.

Union

everyThree = [chr(x+65) for x in range(26)][::3]
everyThree = sc.parallelize(everyThree)
everyThree.collect()
['A', 'D', 'G', 'J', 'M', 'P', 'S', 'V', 'Y']
everyFour = [chr(x+65) for x in range(26)][::4]
everyFour = sc.parallelize(everyFour)
everyFour.collect()
['A', 'E', 'I', 'M', 'Q', 'U', 'Y']

Note that the unioning removes duplicate entries

print(everyThree.union(everyFour).collect())
['A', 'D', 'G', 'J', 'M', 'P', 'S', 'V', 'Y', 'A', 'E', 'I', 'M', 'Q', 'U', 'Y']

Intersection and Subtract

vowelsAlways = ['A', 'E', 'I', 'O', 'U']
vowelsSometimes = vowelsAlways[:] + ['Y']
vowelsAlways = sc.parallelize(vowelsAlways)
vowelsSometimes = sc.parallelize(vowelsSometimes)
print(vowelsAlways.collect())
print(vowelsSometimes.collect())
['A', 'E', 'I', 'O', 'U']
['A', 'E', 'I', 'O', 'U', 'Y']
vowelsSometimes.intersection(vowelsAlways).collect()
['O', 'I', 'A', 'U', 'E']
vowelsSometimes.subtract(vowelsAlways).collect()
['Y']

Set Operations Return more RDDs

You can get pretty complicated with your set operations. It’s all just RDDs.

wordSoup = everyFour.union(everyThree)
sansVowels = wordSoup.subtract(vowelsAlways)
print(sansVowels.collect())
['J', 'Q', 'S', 'D', 'M', 'M', 'Y', 'Y', 'G', 'P', 'V']

Distinct

text = 'the quick brown fox jumped over the lazy dog'
text = sc.parallelize(text)
text.distinct().count()
26

Performance Note

Neat as this is, recall that Spark is only designed for the scope of each partition.

That’s to say that if you have your data split up over N partitions, the set operation per partition is probably pretty cheap, but then it has to exhaustively check in with each other partition to ensure that it’s performing the operation correctly.

This scales… poorly, haha