import findspark findspark.init() import pyspark sc = pyspark.SparkContext()
Borrowed, wholesale, from Learning Spark, we’re going to do a simple spam classifier to determine if an email is authentic or not. This post is less about the approach, and more about examining the building blocks of the
MLlib pipeline. But first…
Each line represents a separate email. The dataset came pre-sorted by spam/not-spam, we’re going split over words then do some fancy Spark pre-processing.
spam = sc.textFile('../data/spam.txt')
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to send you money via wire transfer so please ... Get Viagra real cheap! Send money right away to ... Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ... YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to this email with your password and SSN ... THIS IS NOT A SCAM! Send money and get access to awesome stuff really cheap and never have to ...
# ostensibly 'good spam' ham = sc.textFile('../data/ham.txt')
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014! Check out videos of talks from the summit at ... Hi Mom, Apologies for being late about emailing and forgetting to send you the package. I hope you and bro have been ... Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to take time to try it out immediately ... Hi Spark user list, This is my first question to this list, so thanks in advance for your help! I tried running ... Thanks Tom for your email. I need to refer you to Alice for this one. I haven't yet figured out that part either ... Good job yesterday! I was attending your talk, and really enjoyed it. I want to try out GraphX ... Summit demo got whoops from audience! Had to let you know. --Joe
MLlib works at the
TF, Term Frequency
HashingTF object we’re about to use deserves a whole post in itself, but for now we’ll hand-wave and say that it provides a way to take strings and convert them to a numeric representation.
from pyspark.mllib.feature import HashingTF
We could get away with making an arbitrarily-large feature space, but let’s look at how many distinct words there are in the dataset.
uniqueWords = (spam.union(ham).flatMap(lambda x: x.split()) .distinct()) uniqueWords.count()
And then truncate down to 100 features for roundness.
tf = HashingTF(numFeatures=100)
And continue to update
tf based on what values it sees in
spamFeatures = spam.map(lambda email: tf.transform(email.split(' ')))
hamFeatures = ham.map(lambda email: tf.transform(email.split(' ')))
And this just standard modeling fare, creating a
y for each
X, but now with the
from pyspark.mllib.regression import LabeledPoint
positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1, features)) negativeExamples = hamFeatures.map(lambda features: LabeledPoint(0, features))
trainingData = positiveExamples.union(negativeExamples)
cache the data, because we’ll be doing a ton of iterating when we do our Machine Learning routine.
UnionRDD at union at <unknown>:0
If that’s looks like
sklearn, that’s on purpose.
from pyspark.mllib.classification import LogisticRegressionWithSGD
model = LogisticRegressionWithSGD.train(trainingData)
Finally, we’ll make some fake email strings to see how accurate our classifier proves to be.
posTest = tf.transform("Congrats, you won! Send me money!".split(' '))
negTest = tf.transform("Hey, yeah. I think you're right about this one.".split(' '))