Simple Stats Functions

One of the first things I found myself missing after going from Pandas to PySpark was the ability to quickly hop in and get acclimated with my data.

And while the suite of functionality doesn’t perfectly carry over, it’s worth noting that some of the more useful light-EDA methods have PySpark equivalents.

Data

Revisiting the dataset from SQL Basics

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)
calls = spark.read.json('../data/callDetails.json')

calls.show(10)
+----------+-------------------+--------------+----------+--------------+--------------+
|CallCharge|           DateTime|          Dest|    Origin|OriginatingNum|TerminatingNum|
+----------+-------------------+--------------+----------+--------------+--------------+
|       549|02/11/2016 01:51:41|    Birmingham|    London|     797308107|     797131221|
|      2645|05/02/2016 01:26:54|        London|Manchester|     777121117|     777440392|
|      1233|01/12/2016 21:12:54|    Manchester|  Victoria|     797009202|     784243404|
|      2651|07/11/2016 01:07:34|      Victoria|Twickenham|     777557705|     798420467|
|      3162|02/11/2016 22:22:26|      Scotland|     Leeds|     785434022|     779086250|
|      2246|05/01/2016 20:12:35|Virginia Water|  Bradford|     779716202|     795137353|
|       571|04/12/2016 23:53:52|         Ascot| Yorkshire|     775490102|     775019605|
|      3291|06/11/2016 20:31:49|     Bracknell|Birmingham|     787581376|     797043387|
|      2270|03/12/2016 12:15:17|      Bradford| Coventary|     789231956|     787649491|
|      3420|06/02/2016 20:57:44|     Yorkshire|     Wales|     785969980|     789993090|
+----------+-------------------+--------------+----------+--------------+--------------+
only showing top 10 rows

Descriptive Statistics

From the get-go, we’re probably interested in knowing

Describe

If we want to know what kind of data we’re looking at

calls.dtypes
[('CallCharge', 'bigint'),
 ('DateTime', 'string'),
 ('Dest', 'string'),
 ('Origin', 'string'),
 ('OriginatingNum', 'bigint'),
 ('TerminatingNum', 'bigint')]

And how each field is distributed.

Notice that the string columns (thankfully) don’t have a mean or stddev, but the min/max, amusingly, returns an alphabetical sort of values.

calls.describe().show()
+-------+------------------+-------------------+---------+---------+-----------------+-----------------+
|summary|        CallCharge|           DateTime|     Dest|   Origin|   OriginatingNum|   TerminatingNum|
+-------+------------------+-------------------+---------+---------+-----------------+-----------------+
|  count|               100|                100|      100|      100|              100|              100|
|   mean|           1878.52|               null|     null|     null|   7.8435580264E8|   7.8687342871E8|
| stddev|1091.9465010760011|               null|     null|     null|7903816.763526337|8173751.769309956|
|    min|                 8|01/02/2016 03:07:33|    Ascot|    Ascot|        774188291|        774001818|
|    max|              3596|09/12/2016 22:26:41|Yorkshire|Yorkshire|        799950372|        799779480|
+-------+------------------+-------------------+---------+---------+-----------------+-----------------+

Quantiles

We can use the approxQuantile function to get an even better idea of the dispersion of our data.

However, because all of the data is distributed across several nodes, this is an approximation, as exhaustively sorting every record involves a good deal of shuffling. We can explore the trade-off between performance and accuracy by playing with the relativeError argument.

We can also look for multiple different quantiles by passing floats between 0 and 1 to probabilities.

calls.approxQuantile(col='CallCharge',
                     probabilities=[.68, .95, .997],
                     relativeError=0.01)
[2551.0, 3405.0, 3596.0]
calls.approxQuantile(col='CallCharge',
                     probabilities=[.68, .95, .997],
                     relativeError=0.1)
[2385.0, 3596.0, 3596.0]
calls.approxQuantile(col='CallCharge',
                     probabilities=[.68, .95, .997],
                     relativeError=0.5)
[3596.0, 3596.0, 3596.0]

Correlation

PySpark doesn’t have a handy “correlate everything with everything” function like pandas does– that would be an absurd amount of shuffling.

However, passing two attributes to corr is pretty performant and useful for spot-checks.

calls.corr('OriginatingNum', 'TerminatingNum')
0.06796709282363027

Cross-Tabulation

We’ll often look to crosstab when we’re trying to figure out the co-incidence of different values in different features.

I’ve scaled down the data here because as you can see, generating rows and columns for each pair quickly creates a huge sparse matrix.

calls.limit(10).crosstab('Dest', 'Origin').show()
+--------------+----------+--------+---------+-----+------+----------+----------+--------+-----+---------+
|   Dest_Origin|Birmingham|Bradford|Coventary|Leeds|London|Manchester|Twickenham|Victoria|Wales|Yorkshire|
+--------------+----------+--------+---------+-----+------+----------+----------+--------+-----+---------+
|        London|         0|       0|        0|    0|     0|         1|         0|       0|    0|        0|
|     Bracknell|         1|       0|        0|    0|     0|         0|         0|       0|    0|        0|
|      Victoria|         0|       0|        0|    0|     0|         0|         1|       0|    0|        0|
|Virginia Water|         0|       1|        0|    0|     0|         0|         0|       0|    0|        0|
|    Manchester|         0|       0|        0|    0|     0|         0|         0|       1|    0|        0|
|      Scotland|         0|       0|        0|    1|     0|         0|         0|       0|    0|        0|
|     Yorkshire|         0|       0|        0|    0|     0|         0|         0|       0|    1|        0|
|    Birmingham|         0|       0|        0|    0|     1|         0|         0|       0|    0|        0|
|         Ascot|         0|       0|        0|    0|     0|         0|         0|       0|    0|        1|
|      Bradford|         0|       0|        1|    0|     0|         0|         0|       0|    0|        0|
+--------------+----------+--------+---------+-----+------+----------+----------+--------+-----+---------+