Basic Functional Programming Functions

23 May 2018

import findspark
findspark.init()

import pyspark

sc = pyspark.SparkContext()

Basic Functional Programming Functions

Spark has plenty of analogues to the native Python fucntional programming methods. However, as discussed in Transformations and Actions, nothing gets evaluated, merely strung together into RDDs. Only when we call some sort of Action do we get any actual computation.

Let’s consider a simple dataset

rdd = sc.parallelize([1, 2, 3, 4, 5])

Familiar fns

Map

mapped = rdd.map(lambda x: chr(x+64))
mapped.collect()

['A', 'B', 'C', 'D', 'E']

Filter

filtered = rdd.filter(lambda x: x != 1)
filtered.collect()

[2, 3, 4, 5]

Reduce

The reduce function is an Action, thus gets evaluated.

from operator import add

added = rdd.reduce(add)
added

New fns

FlatMap

Flatmap combines typical mapping with behavior similar to the itertools.chain function

text = sc.textFile('../data/zen.txt')

Calling just map.split() on our text file returns a list of lists, split over newlines.

print(text.map(lambda x: x.split()).collect()[:3])

[['Beautiful', 'is', 'better', 'than', 'ugly.'], ['Explicit', 'is', 'better', 'than', 'implicit.'], ['Simple', 'is', 'better', 'than', 'complex.']]

However, using flatmap, we consolidate many sublists into one list.

print(text.flatMap(lambda x: x.split()).collect()[:30])

['Beautiful', 'is', 'better', 'than', 'ugly.', 'Explicit', 'is', 'better', 'than', 'implicit.', 'Simple', 'is', 'better', 'than', 'complex.', 'Complex', 'is', 'better', 'than', 'complicated.', 'Flat', 'is', 'better', 'than', 'nested.', 'Sparse', 'is', 'better', 'than', 'dense.']

Fold

Simlar to the reduce function, fold consolidates many records into one value. There difference here, though, is that we can specify a starting value to begin reduction on.

For instance:

rdd = sc.parallelize(range(10))

rdd.reduce(add)

Here we specify that we want to add the numbers [0:10) to the number 99.

rdd.fold(99, add)

This might seem unintuitive. However, fold works at the partition level. Peeking behind the curtain with the glom function, we can see that our original dataset is split up across 5 different partitions

rdd.glom().collect()

[[0, 1], [2, 3, 4], [5, 6], [7, 8, 9]]

Therefore, the reduce step happens across each of them and are then reduced together, which works out to:

$99*5 + \sum\limits_{i=1}^{9}{i} = 540$