Column Value Counts
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()['data']
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
df.head()
a | b | c | d | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
pandas
has a really useful function for determining how many values are in a given column.
df['d'].value_counts().head()
0.2 28
1.3 13
1.5 12
1.8 12
1.4 8
Name: d, dtype: int64
Unfortunately though, porting that same DataFrame to a Spark.DataFrame
,
sparkDf = spark.createDataFrame(df)
sparkDf.show(5)
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
|5.1|3.5|1.4|0.2|
|4.9|3.0|1.4|0.2|
|4.7|3.2|1.3|0.2|
|4.6|3.1|1.5|0.2|
|5.0|3.6|1.4|0.2|
+---+---+---+---+
only showing top 5 rows
the value_counts
method isn’t available.
[x for x in dir(sparkDf.select('d')) if 'value' in x]
[]
Making Our Own
value_counts
was basically just
- Grouping like-records together
- Counting the size of the groups
from pyspark.sql.functions import count
sparkDf.groupby('d').count().show(5)
+---+-----+
| d|count|
+---+-----+
|2.4| 3|
|0.2| 28|
|1.4| 8|
|1.7| 2|
|2.3| 8|
+---+-----+
only showing top 5 rows
However, unlike pandas
, this doesn’t come out-of-the-box sorted for your convenience. So after a quick revision
(sparkDf.groupby('d').count()
.orderBy('count', ascending=False).show(5))
+---+-----+
| d|count|
+---+-----+
|0.2| 28|
|1.3| 13|
|1.8| 12|
|1.5| 12|
|1.4| 8|
+---+-----+
only showing top 5 rows
We can package this for later use!
def value_counts(df, colName):
return (df.groupby(colName).count()
.orderBy('count', ascending=False))
value_counts(sparkDf, 'd').show(5)
+---+-----+
| d|count|
+---+-----+
|0.2| 28|
|1.3| 13|
|1.5| 12|
|1.8| 12|
|1.4| 8|
+---+-----+
only showing top 5 rows