Working with NULL Data
Missing data is a routine part of any Data Scientist’s day-to-day. It’s so fundamental, in fact, that moving over to PySpark
can feel a bit jarring because it’s not quite as immediately intuitive as other tools.
However, if you can keep in mind that because of the way everything’s stored/partitioned, PySpark only handles NULL values at the Row-level, things click a bit easier.
Some Spotty Data
I went through the iris dataset and randomly injected a bunch of NULL
values. Let’s load it up (deliberately forcing each column to be read in as a float
except the last column– more on this later)
filepath = '../data/somenulls.csv'
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
from pyspark.sql.types import *
schema = StructType([
StructField('a', FloatType(), True),
StructField('b', FloatType(), True),
StructField('c', FloatType(), True),
StructField('d', FloatType(), True),
StructField('e', StringType(), True),
])
df = spark.read.csv(filepath, schema=schema,
header=True)
df.show()
+----+----+----+----+----+
| a| b| c| d| e|
+----+----+----+----+----+
| 5.1| 3.5| 1.4| 0.2|null|
| 4.9| 3.0| 1.4| 0.2|null|
| 4.7|null| 1.3| 0.2|null|
| 4.6| 3.1| 1.5| 0.2| 0|
|null| 3.6| 1.4| 0.2| 0|
| 5.4| 3.9| 1.7| 0.4| 0|
| 4.6| 3.4| 1.4|null|null|
| 5.0| 3.4| 1.5| 0.2|null|
| 4.4|null|null| 0.2| 0|
| 4.9|null| 1.5| 0.1| 0|
| 5.4|null| 1.5| 0.2| 0|
| 4.8|null| 1.6| 0.2| 0|
| 4.8| 3.0| 1.4|null| 0|
| 4.3|null| 1.1|null| 0|
| 5.8| 4.0| 1.2| 0.2| 0|
|null| 4.4| 1.5| 0.4| 0|
| 5.4| 3.9| 1.3| 0.4| 0|
| 5.1| 3.5|null|null| 0|
| 5.7| 3.8| 1.7| 0.3| 0|
| 5.1| 3.8| 1.5| 0.3| 0|
+----+----+----+----+----+
only showing top 20 rows
We’ve got 150 records.
df.count()
150
Tossing Out Rows
The drop
function looks, row-wise, for any NULL values and removes the appropriate rows from our DataFrame
, based on whatever strategy we want to employ. We can either:
Drop rows that have any NULL values with how='any'
df.dropna(how='any').count()
68
Drop rows that are all NULL with how='all'
# only one row had all NULLs
df.dropna(how='all').count()
149
Or set a threshold for the minimum number of non-NULL values allowed in a row
df.dropna(thresh=4).count()
126
We can also narrow our filtering criteria to particular rows, as some columns may have more significant NULL values than others.
Here, we’ll only drop rows if they have NULL values in the last 2 columns.
df.dropna(how='any', subset=['d', 'e']).count()
109
Filling Missing Values
PySpark
doesn’t come with a bunch of fancy data imputation methods, batteries-included– that’s more in the realm of the MLLib stack. But for the quick and dirty, let’s start simple and look at the first 4 columns of the same float
datatype
floats = df.drop('e')
floats.show(5)
+----+----+---+---+
| a| b| c| d|
+----+----+---+---+
| 5.1| 3.5|1.4|0.2|
| 4.9| 3.0|1.4|0.2|
| 4.7|null|1.3|0.2|
| 4.6| 3.1|1.5|0.2|
|null| 3.6|1.4|0.2|
+----+----+---+---+
only showing top 5 rows
If we want to replace our missing values, we simply use the fillna
function, passing in a number
floats.fillna(-999).show(5)
+------+------+---+---+
| a| b| c| d|
+------+------+---+---+
| 5.1| 3.5|1.4|0.2|
| 4.9| 3.0|1.4|0.2|
| 4.7|-999.0|1.3|0.2|
| 4.6| 3.1|1.5|0.2|
|-999.0| 3.6|1.4|0.2|
+------+------+---+---+
only showing top 5 rows
and like the dropna
method, we can specify which column(s) to pay attention to
floats.fillna(-999, subset=['b']).show(5)
+----+------+---+---+
| a| b| c| d|
+----+------+---+---+
| 5.1| 3.5|1.4|0.2|
| 4.9| 3.0|1.4|0.2|
| 4.7|-999.0|1.3|0.2|
| 4.6| 3.1|1.5|0.2|
|null| 3.6|1.4|0.2|
+----+------+---+---+
only showing top 5 rows
Similarly, we can do the same thing by passing strings into categorical/string columns
strings = df.select('e')
strings.show(5)
+----+
| e|
+----+
|null|
|null|
|null|
| 0|
| 0|
+----+
only showing top 5 rows
strings.fillna('unknown').show(5)
+-------+
| e|
+-------+
|unknown|
|unknown|
|unknown|
| 0|
| 0|
+-------+
only showing top 5 rows
Where this gets tricky is when we try to do both at the same time.
PySpark will intuit that a number in fillna
corresponds to the numeric columns
df.fillna(-99).show(5)
+-----+-----+---+---+----+
| a| b| c| d| e|
+-----+-----+---+---+----+
| 5.1| 3.5|1.4|0.2|null|
| 4.9| 3.0|1.4|0.2|null|
| 4.7|-99.0|1.3|0.2|null|
| 4.6| 3.1|1.5|0.2| 0|
|-99.0| 3.6|1.4|0.2| 0|
+-----+-----+---+---+----+
only showing top 5 rows
and that a string means the string columns
df.fillna('who knows?').show(5)
+----+----+---+---+----------+
| a| b| c| d| e|
+----+----+---+---+----------+
| 5.1| 3.5|1.4|0.2|who knows?|
| 4.9| 3.0|1.4|0.2|who knows?|
| 4.7|null|1.3|0.2|who knows?|
| 4.6| 3.1|1.5|0.2| 0|
|null| 3.6|1.4|0.2| 0|
+----+----+---+---+----------+
only showing top 5 rows
but takes great offense when you try to do both
df.fillna([-99, 'who knows']).show(5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-68c95a523395> in <module>()
----> 1 df.fillna([-99, 'who knows']).show(5)
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\dataframe.py in fillna(self, value, subset)
1512 """
1513 if not isinstance(value, (float, int, long, basestring, bool, dict)):
-> 1514 raise ValueError("value should be a float, int, long, string, bool or dict")
1515
1516 # Note that bool validates isinstance(int), but we don't want to
ValueError: value should be a float, int, long, string, bool or dict
or ignores you completely
df.fillna(-99, 'who knows').show(5)
+----+----+---+---+----+
| a| b| c| d| e|
+----+----+---+---+----+
| 5.1| 3.5|1.4|0.2|null|
| 4.9| 3.0|1.4|0.2|null|
| 4.7|null|1.3|0.2|null|
| 4.6| 3.1|1.5|0.2| 0|
|null| 3.6|1.4|0.2| 0|
+----+----+---+---+----+
only showing top 5 rows
You can, however, chain them together!
(df.fillna(-99)
.fillna('unknown').show(5))
+-----+-----+---+---+-------+
| a| b| c| d| e|
+-----+-----+---+---+-------+
| 5.1| 3.5|1.4|0.2|unknown|
| 4.9| 3.0|1.4|0.2|unknown|
| 4.7|-99.0|1.3|0.2|unknown|
| 4.6| 3.1|1.5|0.2| 0|
|-99.0| 3.6|1.4|0.2| 0|
+-----+-----+---+---+-------+
only showing top 5 rows
Removing Columns with Missing Data
This gets trickier when you’re looking at the column-level.