Simple Spam Classfication in MLlib

import findspark

import pyspark
sc = pyspark.SparkContext()

Borrowed, wholesale, from Learning Spark, we’re going to do a simple spam classifier to determine if an email is authentic or not. This post is less about the approach, and more about examining the building blocks of the MLlib pipeline. But first…

The Data

Each line represents a separate email. The dataset came pre-sorted by spam/not-spam, we’re going split over words then do some fancy Spark pre-processing.

spam = sc.textFile('../data/spam.txt')
!type ..\data\spam.txt
Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...
Get Viagra real cheap!  Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to this email with your password and SSN ...
THIS IS NOT A SCAM!  Send money and get access to awesome stuff really cheap and never have to ...
# ostensibly 'good spam'
ham = sc.textFile('../data/ham.txt')
!type ..\data\ham.txt
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!  Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you the package.  I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in advance for your help!  I tried running ...
Thanks Tom for your email.  I need to refer you to Alice for this one.  I haven't yet figured out that part either ...
Good job yesterday!  I was attending your talk, and really enjoyed it.  I want to try out GraphX ...
Summit demo got whoops from audience!  Had to let you know. --Joe

MLlib Tools

MLlib works at the RDD level.

TF, Term Frequency

The HashingTF object we’re about to use deserves a whole post in itself, but for now we’ll hand-wave and say that it provides a way to take strings and convert them to a numeric representation.

from pyspark.mllib.feature import HashingTF

We could get away with making an arbitrarily-large feature space, but let’s look at how many distinct words there are in the dataset.

uniqueWords = (spam.union(ham).flatMap(lambda x: x.split())


And then truncate down to 100 features for roundness.

tf = HashingTF(numFeatures=100)

And continue to update tf based on what values it sees in spam and ham

spamFeatures = email: tf.transform(email.split(' ')))
hamFeatures = email: tf.transform(email.split(' ')))

Label Datasets

And this just standard modeling fare, creating a y for each X, but now with the map function.

from pyspark.mllib.regression import LabeledPoint
positiveExamples = features: LabeledPoint(1, features))
negativeExamples = features: LabeledPoint(0, features))
trainingData = positiveExamples.union(negativeExamples)

And we cache the data, because we’ll be doing a ton of iterating when we do our Machine Learning routine.

UnionRDD[430] at union at <unknown>:0

Train Model

If that’s looks like sklearn, that’s on purpose.

from pyspark.mllib.classification import LogisticRegressionWithSGD
model = LogisticRegressionWithSGD.train(trainingData)

Evaluate Model

Finally, we’ll make some fake email strings to see how accurate our classifier proves to be.

posTest = tf.transform("Congrats, you won! Send me money!".split(' '))
negTest = tf.transform("Hey, yeah. I think you're right about this one.".split(' '))