Simple Stats Functions
One of the first things I found myself missing after going from Pandas to PySpark was the ability to quickly hop in and get acclimated with my data.
And while the suite of functionality doesn’t perfectly carry over, it’s worth noting that some of the more useful light-EDA methods have PySpark equivalents.
Data
Revisiting the dataset from SQL Basics
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
calls = spark.read.json('../data/callDetails.json')
calls.show(10)
+----------+-------------------+--------------+----------+--------------+--------------+
|CallCharge| DateTime| Dest| Origin|OriginatingNum|TerminatingNum|
+----------+-------------------+--------------+----------+--------------+--------------+
| 549|02/11/2016 01:51:41| Birmingham| London| 797308107| 797131221|
| 2645|05/02/2016 01:26:54| London|Manchester| 777121117| 777440392|
| 1233|01/12/2016 21:12:54| Manchester| Victoria| 797009202| 784243404|
| 2651|07/11/2016 01:07:34| Victoria|Twickenham| 777557705| 798420467|
| 3162|02/11/2016 22:22:26| Scotland| Leeds| 785434022| 779086250|
| 2246|05/01/2016 20:12:35|Virginia Water| Bradford| 779716202| 795137353|
| 571|04/12/2016 23:53:52| Ascot| Yorkshire| 775490102| 775019605|
| 3291|06/11/2016 20:31:49| Bracknell|Birmingham| 787581376| 797043387|
| 2270|03/12/2016 12:15:17| Bradford| Coventary| 789231956| 787649491|
| 3420|06/02/2016 20:57:44| Yorkshire| Wales| 785969980| 789993090|
+----------+-------------------+--------------+----------+--------------+--------------+
only showing top 10 rows
Descriptive Statistics
From the get-go, we’re probably interested in knowing
Describe
If we want to know what kind of data we’re looking at
calls.dtypes
[('CallCharge', 'bigint'),
('DateTime', 'string'),
('Dest', 'string'),
('Origin', 'string'),
('OriginatingNum', 'bigint'),
('TerminatingNum', 'bigint')]
And how each field is distributed.
Notice that the string
columns (thankfully) don’t have a mean
or stddev
, but the min/max
, amusingly, returns an alphabetical sort of values.
calls.describe().show()
+-------+------------------+-------------------+---------+---------+-----------------+-----------------+
|summary| CallCharge| DateTime| Dest| Origin| OriginatingNum| TerminatingNum|
+-------+------------------+-------------------+---------+---------+-----------------+-----------------+
| count| 100| 100| 100| 100| 100| 100|
| mean| 1878.52| null| null| null| 7.8435580264E8| 7.8687342871E8|
| stddev|1091.9465010760011| null| null| null|7903816.763526337|8173751.769309956|
| min| 8|01/02/2016 03:07:33| Ascot| Ascot| 774188291| 774001818|
| max| 3596|09/12/2016 22:26:41|Yorkshire|Yorkshire| 799950372| 799779480|
+-------+------------------+-------------------+---------+---------+-----------------+-----------------+
Quantiles
We can use the approxQuantile
function to get an even better idea of the dispersion of our data.
However, because all of the data is distributed across several nodes, this is an approximation, as exhaustively sorting every record involves a good deal of shuffling. We can explore the trade-off between performance and accuracy by playing with the relativeError
argument.
We can also look for multiple different quantiles by passing floats between 0
and 1
to probabilities
.
calls.approxQuantile(col='CallCharge',
probabilities=[.68, .95, .997],
relativeError=0.01)
[2551.0, 3405.0, 3596.0]
calls.approxQuantile(col='CallCharge',
probabilities=[.68, .95, .997],
relativeError=0.1)
[2385.0, 3596.0, 3596.0]
calls.approxQuantile(col='CallCharge',
probabilities=[.68, .95, .997],
relativeError=0.5)
[3596.0, 3596.0, 3596.0]
Correlation
PySpark
doesn’t have a handy “correlate everything with everything” function like pandas
does– that would be an absurd amount of shuffling.
However, passing two attributes to corr
is pretty performant and useful for spot-checks.
calls.corr('OriginatingNum', 'TerminatingNum')
0.06796709282363027
Cross-Tabulation
We’ll often look to crosstab
when we’re trying to figure out the co-incidence of different values in different features.
I’ve scaled down the data here because as you can see, generating rows and columns for each pair quickly creates a huge sparse matrix.
calls.limit(10).crosstab('Dest', 'Origin').show()
+--------------+----------+--------+---------+-----+------+----------+----------+--------+-----+---------+
| Dest_Origin|Birmingham|Bradford|Coventary|Leeds|London|Manchester|Twickenham|Victoria|Wales|Yorkshire|
+--------------+----------+--------+---------+-----+------+----------+----------+--------+-----+---------+
| London| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0|
| Bracknell| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| Victoria| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0|
|Virginia Water| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0|
| Manchester| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0|
| Scotland| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0|
| Yorkshire| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0|
| Birmingham| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0|
| Ascot| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1|
| Bradford| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0|
+--------------+----------+--------+---------+-----+------+----------+----------+--------+-----+---------+