Simple Spam Classfication in MLlib
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
Borrowed, wholesale, from Learning Spark, we’re going to do a simple spam classifier to determine if an email is authentic or not. This post is less about the approach, and more about examining the building blocks of the MLlib
pipeline. But first…
The Data
Each line represents a separate email. The dataset came pre-sorted by spam/not-spam, we’re going split over words then do some fancy Spark pre-processing.
spam = sc.textFile('../data/spam.txt')
!type ..\data\spam.txt
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to send you money via wire transfer so please ...
Get Viagra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really cheap and never have to ...
# ostensibly 'good spam'
ham = sc.textFile('../data/ham.txt')
!type ..\data\ham.txt
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014! Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you the package. I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in advance for your help! I tried running ...
Thanks Tom for your email. I need to refer you to Alice for this one. I haven't yet figured out that part either ...
Good job yesterday! I was attending your talk, and really enjoyed it. I want to try out GraphX ...
Summit demo got whoops from audience! Had to let you know. --Joe
MLlib Tools
MLlib
works at the RDD
level.
TF, Term Frequency
The HashingTF
object we’re about to use deserves a whole post in itself, but for now we’ll hand-wave and say that it provides a way to take strings and convert them to a numeric representation.
from pyspark.mllib.feature import HashingTF
We could get away with making an arbitrarily-large feature space, but let’s look at how many distinct words there are in the dataset.
uniqueWords = (spam.union(ham).flatMap(lambda x: x.split())
.distinct())
uniqueWords.count()
154
And then truncate down to 100 features for roundness.
tf = HashingTF(numFeatures=100)
And continue to update tf
based on what values it sees in spam
and ham
spamFeatures = spam.map(lambda email: tf.transform(email.split(' ')))
hamFeatures = ham.map(lambda email: tf.transform(email.split(' ')))
Label Datasets
And this just standard modeling fare, creating a y
for each X
, but now with the map
function.
from pyspark.mllib.regression import LabeledPoint
positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1, features))
negativeExamples = hamFeatures.map(lambda features: LabeledPoint(0, features))
trainingData = positiveExamples.union(negativeExamples)
And we cache
the data, because we’ll be doing a ton of iterating when we do our Machine Learning routine.
trainingData.cache()
UnionRDD[430] at union at <unknown>:0
Train Model
If that’s looks like sklearn
, that’s on purpose.
from pyspark.mllib.classification import LogisticRegressionWithSGD
model = LogisticRegressionWithSGD.train(trainingData)
Evaluate Model
Finally, we’ll make some fake email strings to see how accurate our classifier proves to be.
posTest = tf.transform("Congrats, you won! Send me money!".split(' '))
negTest = tf.transform("Hey, yeah. I think you're right about this one.".split(' '))
model.predict(posTest)
1
model.predict(negTest)
0