Column Value Counts

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)
from sklearn.datasets import load_iris

import pandas as pd
data = load_iris()['data']

df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
df.head()
a b c d
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

pandas has a really useful function for determining how many values are in a given column.

df['d'].value_counts().head()
0.2    28
1.3    13
1.5    12
1.8    12
1.4     8
Name: d, dtype: int64

Unfortunately though, porting that same DataFrame to a Spark.DataFrame,

sparkDf = spark.createDataFrame(df)
sparkDf.show(5)
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|5.1|3.5|1.4|0.2|
|4.9|3.0|1.4|0.2|
|4.7|3.2|1.3|0.2|
|4.6|3.1|1.5|0.2|
|5.0|3.6|1.4|0.2|
+---+---+---+---+
only showing top 5 rows

the value_counts method isn’t available.

[x for x in dir(sparkDf.select('d')) if 'value' in x]
[]

Making Our Own

value_counts was basically just

  • Grouping like-records together
  • Counting the size of the groups
from pyspark.sql.functions import count
sparkDf.groupby('d').count().show(5)
+---+-----+
|  d|count|
+---+-----+
|2.4|    3|
|0.2|   28|
|1.4|    8|
|1.7|    2|
|2.3|    8|
+---+-----+
only showing top 5 rows

However, unlike pandas, this doesn’t come out-of-the-box sorted for your convenience. So after a quick revision

(sparkDf.groupby('d').count()
        .orderBy('count', ascending=False).show(5))
+---+-----+
|  d|count|
+---+-----+
|0.2|   28|
|1.3|   13|
|1.8|   12|
|1.5|   12|
|1.4|    8|
+---+-----+
only showing top 5 rows

We can package this for later use!

def value_counts(df, colName):
    return (df.groupby(colName).count()
              .orderBy('count', ascending=False))
value_counts(sparkDf, 'd').show(5)
+---+-----+
|  d|count|
+---+-----+
|0.2|   28|
|1.3|   13|
|1.5|   12|
|1.8|   12|
|1.4|    8|
+---+-----+
only showing top 5 rows