Working with NULL Data

07 Jun 2018

Missing data is a routine part of any Data Scientist’s day-to-day. It’s so fundamental, in fact, that moving over to PySpark can feel a bit jarring because it’s not quite as immediately intuitive as other tools.

However, if you can keep in mind that because of the way everything’s stored/partitioned, PySpark only handles NULL values at the Row-level, things click a bit easier.

Some Spotty Data

I went through the iris dataset and randomly injected a bunch of NULL values. Let’s load it up (deliberately forcing each column to be read in as a float except the last column– more on this later)

filepath = '../data/somenulls.csv'

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)

from pyspark.sql.types import *

schema = StructType([
    StructField('a', FloatType(), True),
    StructField('b', FloatType(), True),
    StructField('c', FloatType(), True),
    StructField('d', FloatType(), True),
    StructField('e', StringType(), True),
])

df = spark.read.csv(filepath, schema=schema,
                    header=True)


df.show()

+----+----+----+----+----+
|   a|   b|   c|   d|   e|
+----+----+----+----+----+
| 5.1| 3.5| 1.4| 0.2|null|
| 4.9| 3.0| 1.4| 0.2|null|
| 4.7|null| 1.3| 0.2|null|
| 4.6| 3.1| 1.5| 0.2|   0|
|null| 3.6| 1.4| 0.2|   0|
| 5.4| 3.9| 1.7| 0.4|   0|
| 4.6| 3.4| 1.4|null|null|
| 5.0| 3.4| 1.5| 0.2|null|
| 4.4|null|null| 0.2|   0|
| 4.9|null| 1.5| 0.1|   0|
| 5.4|null| 1.5| 0.2|   0|
| 4.8|null| 1.6| 0.2|   0|
| 4.8| 3.0| 1.4|null|   0|
| 4.3|null| 1.1|null|   0|
| 5.8| 4.0| 1.2| 0.2|   0|
|null| 4.4| 1.5| 0.4|   0|
| 5.4| 3.9| 1.3| 0.4|   0|
| 5.1| 3.5|null|null|   0|
| 5.7| 3.8| 1.7| 0.3|   0|
| 5.1| 3.8| 1.5| 0.3|   0|
+----+----+----+----+----+
only showing top 20 rows

We’ve got 150 records.

df.count()

Tossing Out Rows

The drop function looks, row-wise, for any NULL values and removes the appropriate rows from our DataFrame, based on whatever strategy we want to employ. We can either:

Drop rows that have any NULL values with how='any'

df.dropna(how='any').count()

Drop rows that are all NULL with how='all'

# only one row had all NULLs
df.dropna(how='all').count()

Or set a threshold for the minimum number of non-NULL values allowed in a row

df.dropna(thresh=4).count()

We can also narrow our filtering criteria to particular rows, as some columns may have more significant NULL values than others.

Here, we’ll only drop rows if they have NULL values in the last 2 columns.

df.dropna(how='any', subset=['d', 'e']).count()

Filling Missing Values

PySpark doesn’t come with a bunch of fancy data imputation methods, batteries-included– that’s more in the realm of the MLLib stack. But for the quick and dirty, let’s start simple and look at the first 4 columns of the same float datatype

floats = df.drop('e')

floats.show(5)

+----+----+---+---+
|   a|   b|  c|  d|
+----+----+---+---+
| 5.1| 3.5|1.4|0.2|
| 4.9| 3.0|1.4|0.2|
| 4.7|null|1.3|0.2|
| 4.6| 3.1|1.5|0.2|
|null| 3.6|1.4|0.2|
+----+----+---+---+
only showing top 5 rows

If we want to replace our missing values, we simply use the fillna function, passing in a number

floats.fillna(-999).show(5)

+------+------+---+---+
|     a|     b|  c|  d|
+------+------+---+---+
|   5.1|   3.5|1.4|0.2|
|   4.9|   3.0|1.4|0.2|
|   4.7|-999.0|1.3|0.2|
|   4.6|   3.1|1.5|0.2|
|-999.0|   3.6|1.4|0.2|
+------+------+---+---+
only showing top 5 rows

and like the dropna method, we can specify which column(s) to pay attention to

floats.fillna(-999, subset=['b']).show(5)

+----+------+---+---+
|   a|     b|  c|  d|
+----+------+---+---+
| 5.1|   3.5|1.4|0.2|
| 4.9|   3.0|1.4|0.2|
| 4.7|-999.0|1.3|0.2|
| 4.6|   3.1|1.5|0.2|
|null|   3.6|1.4|0.2|
+----+------+---+---+
only showing top 5 rows

Similarly, we can do the same thing by passing strings into categorical/string columns

strings = df.select('e')
strings.show(5)

+----+
|   e|
+----+
|null|
|null|
|null|
|   0|
|   0|
+----+
only showing top 5 rows

strings.fillna('unknown').show(5)

+-------+
|      e|
+-------+
|unknown|
|unknown|
|unknown|
|      0|
|      0|
+-------+
only showing top 5 rows

Where this gets tricky is when we try to do both at the same time.

PySpark will intuit that a number in fillna corresponds to the numeric columns

df.fillna(-99).show(5)

+-----+-----+---+---+----+
|    a|    b|  c|  d|   e|
+-----+-----+---+---+----+
|  5.1|  3.5|1.4|0.2|null|
|  4.9|  3.0|1.4|0.2|null|
|  4.7|-99.0|1.3|0.2|null|
|  4.6|  3.1|1.5|0.2|   0|
|-99.0|  3.6|1.4|0.2|   0|
+-----+-----+---+---+----+
only showing top 5 rows

and that a string means the string columns

df.fillna('who knows?').show(5)

+----+----+---+---+----------+
|   a|   b|  c|  d|         e|
+----+----+---+---+----------+
| 5.1| 3.5|1.4|0.2|who knows?|
| 4.9| 3.0|1.4|0.2|who knows?|
| 4.7|null|1.3|0.2|who knows?|
| 4.6| 3.1|1.5|0.2|         0|
|null| 3.6|1.4|0.2|         0|
+----+----+---+---+----------+
only showing top 5 rows

but takes great offense when you try to do both

df.fillna([-99, 'who knows']).show(5)

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-17-68c95a523395> in <module>()
----> 1 df.fillna([-99, 'who knows']).show(5)


C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\dataframe.py in fillna(self, value, subset)
   1512         """
   1513         if not isinstance(value, (float, int, long, basestring, bool, dict)):
-> 1514             raise ValueError("value should be a float, int, long, string, bool or dict")
   1515 
   1516         # Note that bool validates isinstance(int), but we don't want to


ValueError: value should be a float, int, long, string, bool or dict

or ignores you completely

df.fillna(-99, 'who knows').show(5)

+----+----+---+---+----+
|   a|   b|  c|  d|   e|
+----+----+---+---+----+
| 5.1| 3.5|1.4|0.2|null|
| 4.9| 3.0|1.4|0.2|null|
| 4.7|null|1.3|0.2|null|
| 4.6| 3.1|1.5|0.2|   0|
|null| 3.6|1.4|0.2|   0|
+----+----+---+---+----+
only showing top 5 rows

You can, however, chain them together!

(df.fillna(-99)
   .fillna('unknown').show(5))

+-----+-----+---+---+-------+
|    a|    b|  c|  d|      e|
+-----+-----+---+---+-------+
|  5.1|  3.5|1.4|0.2|unknown|
|  4.9|  3.0|1.4|0.2|unknown|
|  4.7|-99.0|1.3|0.2|unknown|
|  4.6|  3.1|1.5|0.2|      0|
|-99.0|  3.6|1.4|0.2|      0|
+-----+-----+---+---+-------+
only showing top 5 rows

Removing Columns with Missing Data

This gets trickier when you’re looking at the column-level.

We’ve got a whole other page explaining that.