Set Operations
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
Set Operations
Spark also provides functionality similar to the native Python set
operations.
Union
everyThree = [chr(x+65) for x in range(26)][::3]
everyThree = sc.parallelize(everyThree)
everyThree.collect()
['A', 'D', 'G', 'J', 'M', 'P', 'S', 'V', 'Y']
everyFour = [chr(x+65) for x in range(26)][::4]
everyFour = sc.parallelize(everyFour)
everyFour.collect()
['A', 'E', 'I', 'M', 'Q', 'U', 'Y']
Note that the union
ing removes duplicate entries
print(everyThree.union(everyFour).collect())
['A', 'D', 'G', 'J', 'M', 'P', 'S', 'V', 'Y', 'A', 'E', 'I', 'M', 'Q', 'U', 'Y']
Intersection and Subtract
vowelsAlways = ['A', 'E', 'I', 'O', 'U']
vowelsSometimes = vowelsAlways[:] + ['Y']
vowelsAlways = sc.parallelize(vowelsAlways)
vowelsSometimes = sc.parallelize(vowelsSometimes)
print(vowelsAlways.collect())
print(vowelsSometimes.collect())
['A', 'E', 'I', 'O', 'U']
['A', 'E', 'I', 'O', 'U', 'Y']
vowelsSometimes.intersection(vowelsAlways).collect()
['O', 'I', 'A', 'U', 'E']
vowelsSometimes.subtract(vowelsAlways).collect()
['Y']
Set Operations Return more RDDs
You can get pretty complicated with your set operations. It’s all just RDDs.
wordSoup = everyFour.union(everyThree)
sansVowels = wordSoup.subtract(vowelsAlways)
print(sansVowels.collect())
['J', 'Q', 'S', 'D', 'M', 'M', 'Y', 'Y', 'G', 'P', 'V']
Distinct
text = 'the quick brown fox jumped over the lazy dog'
text = sc.parallelize(text)
text.distinct().count()
26
Performance Note
Neat as this is, recall that Spark is only designed for the scope of each partition.
That’s to say that if you have your data split up over N partitions, the set operation per partition is probably pretty cheap, but then it has to exhaustively check in with each other partition to ensure that it’s performing the operation correctly.
This scales… poorly, haha