Conditionally Dropping Columns

07 Jun 2018

filepath = '../data/movieData.csv'

What We’re Used To

Dropping columns of data in pandas is a pretty trivial task.

import pandas as pd

df = pd.read_csv(filepath)
df.head()

	Rank	WeeklyGross	PctChangeWkGross	Theaters	DeltaTheaters	AvgRev	GrossToDate	Week	Thursday	name	year	Winner
0	17.0	967378	NaN	14.0	NaN	69098.0	967378	1	1990-11-18	dances with wolves	1990	True
1	9.0	3871641	300.0	14.0	NaN	276546.0	4839019	2	1990-11-25	dances with wolves	1990	True
2	3.0	12547813	224.0	1048.0	1034.0	11973.0	17386832	3	1990-12-02	dances with wolves	1990	True
3	4.0	9246632	-26.3	1053.0	5.0	8781.0	26633464	4	1990-12-09	dances with wolves	1990	True
4	4.0	7272350	-21.4	1051.0	-2.0	6919.0	33905814	5	1990-12-16	dances with wolves	1990	True

We can either specify which columns we want to drop.

df.drop(['Rank', 'WeeklyGross'], axis=1).head()

	PctChangeWkGross	Theaters	DeltaTheaters	AvgRev	GrossToDate	Week	Thursday	name	year	Winner
0	NaN	14.0	NaN	69098.0	967378	1	1990-11-18	dances with wolves	1990	True
1	300.0	14.0	NaN	276546.0	4839019	2	1990-11-25	dances with wolves	1990	True
2	224.0	1048.0	1034.0	11973.0	17386832	3	1990-12-02	dances with wolves	1990	True
3	-26.3	1053.0	5.0	8781.0	26633464	4	1990-12-09	dances with wolves	1990	True
4	-21.4	1051.0	-2.0	6919.0	33905814	5	1990-12-16	dances with wolves	1990	True

Or write some condition to filter on and pipe it into the DataFrame selector.

Let’s imagine we only want columns that have a '.' in them.

dotCounts = df.apply(lambda x: x.map(str)
                                .str.contains('\.')).sum()
dotCounts

Rank                3836
WeeklyGross            0
PctChangeWkGross    3625
Theaters            3836
DeltaTheaters       3389
AvgRev              3836
GrossToDate            0
Week                   0
Thursday               0
name                  13
year                   0
Winner                 0
dtype: int64

colsWithdots = dotCounts[dotCounts != 0].index
print(colsWithdots)

Index(['Rank', 'PctChangeWkGross', 'Theaters', 'DeltaTheaters', 'AvgRev',
       'name'],
      dtype='object')

df[colsWithdots].head()

	Rank	PctChangeWkGross	Theaters	DeltaTheaters	AvgRev	name
0	17.0	NaN	14.0	NaN	69098.0	dances with wolves
1	9.0	300.0	14.0	NaN	276546.0	dances with wolves
2	3.0	224.0	1048.0	1034.0	11973.0	dances with wolves
3	4.0	-26.3	1053.0	5.0	8781.0	dances with wolves
4	4.0	-21.4	1051.0	-2.0	6919.0	dances with wolves

Ez pz

df['name'][df['name'].str.contains('\.')]

759    l.a. confidential
760    l.a. confidential
761    l.a. confidential
762    l.a. confidential
763    l.a. confidential
764    l.a. confidential
765    l.a. confidential
766    l.a. confidential
767    l.a. confidential
768    l.a. confidential
769    l.a. confidential
770    l.a. confidential
771    l.a. confidential
Name: name, dtype: object

I was curious, too.

Now Spark

Similarly, if we want to read this in as a Spark DataFrame, we’d do the following.

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc)

df = spark.read.csv(filepath, header=True)

df.show(5)

+----+-----------+----------------+--------+-------------+--------+-----------+----+----------+------------------+----+------+
|Rank|WeeklyGross|PctChangeWkGross|Theaters|DeltaTheaters|  AvgRev|GrossToDate|Week|  Thursday|              name|year|Winner|
+----+-----------+----------------+--------+-------------+--------+-----------+----+----------+------------------+----+------+
|17.0|     967378|            null|    14.0|         null| 69098.0|     967378|   1|1990-11-18|dances with wolves|1990|  True|
| 9.0|    3871641|           300.0|    14.0|         null|276546.0|    4839019|   2|1990-11-25|dances with wolves|1990|  True|
| 3.0|   12547813|           224.0|  1048.0|       1034.0| 11973.0|   17386832|   3|1990-12-02|dances with wolves|1990|  True|
| 4.0|    9246632|           -26.3|  1053.0|          5.0|  8781.0|   26633464|   4|1990-12-09|dances with wolves|1990|  True|
| 4.0|    7272350|           -21.4|  1051.0|         -2.0|  6919.0|   33905814|   5|1990-12-16|dances with wolves|1990|  True|
+----+-----------+----------------+--------+-------------+--------+-----------+----+----------+------------------+----+------+
only showing top 5 rows

But trying to drop columns is a little involved.

We can specify which column names we want to keep.

df.select('Rank', 'PctChangeWkGross', 'Theaters',
          'DeltaTheaters', 'AvgRev', 'name').show(5)

+----+----------------+--------+-------------+--------+------------------+
|Rank|PctChangeWkGross|Theaters|DeltaTheaters|  AvgRev|              name|
+----+----------------+--------+-------------+--------+------------------+
|17.0|            null|    14.0|         null| 69098.0|dances with wolves|
| 9.0|           300.0|    14.0|         null|276546.0|dances with wolves|
| 3.0|           224.0|  1048.0|       1034.0| 11973.0|dances with wolves|
| 4.0|           -26.3|  1053.0|          5.0|  8781.0|dances with wolves|
| 4.0|           -21.4|  1051.0|         -2.0|  6919.0|dances with wolves|
+----+----------------+--------+-------------+--------+------------------+
only showing top 5 rows

But we won’t have the column names figured out, so we need to figure out how to get at them, procedurally.

Building Column Conditions

We’re going to need some handy functions to facilitate this

from pyspark.sql.functions import col, count, when

For simplicity, let’s just focus on just the PctChangeWkGross column. We’ll scale up after.

So first, we’ll use the col function on a string to make an instance of the Col class.

pcwg = col('PctChangeWkGross')

So let’s check if each value contains a '.' character

df.select(
    pcwg.contains('.')
).count()

Hmm, that doesn’t seem right. That’s the length of everything.

df.count()

So we need to additionally add the count column function

df.select(
    count(pcwg.contains('.'))
).show()

+------------------------------------+
|count(contains(PctChangeWkGross, .))|
+------------------------------------+
|                                3625|
+------------------------------------+

We shed like 200 records. That’s more like it. Let’s see what happens with column that we know doesn’t have any.

Thursday is all yyyy-mm-dd.

df.select(
    count(col('Thursday').contains('.'))
).show()

+----------------------------+
|count(contains(Thursday, .))|
+----------------------------+
|                        3845|
+----------------------------+

Still wrong. It’s counting a bunch of False values. Last step, we need a way to make it not count these values.

This is where we’ll use the when function. The top 5 values of Thursday look like this.

df.select('Thursday').show(5)

+----------+
|  Thursday|
+----------+
|1990-11-18|
|1990-11-25|
|1990-12-02|
|1990-12-09|
|1990-12-16|
+----------+
only showing top 5 rows

when takes two arguments: - A Column of values with a broadcasted check - A value for “if the check evaluates to True“

Anything that evaluates to False becomes NULL

df.select((when(col('Thursday').contains('11'), 0))).show(5)

+-------------------------------------------+
|CASE WHEN contains(Thursday, 11) THEN 0 END|
+-------------------------------------------+
|                                          0|
|                                          0|
|                                       null|
|                                       null|
|                                       null|
+-------------------------------------------+
only showing top 5 rows

And count will skip right over it.

df.select(count(when(col('Thursday').contains('11'), 0))).show(5)

+--------------------------------------------------+
|count(CASE WHEN contains(Thursday, 11) THEN 0 END)|
+--------------------------------------------------+
|                                               578|
+--------------------------------------------------+

At Scale

Now to do this for multiple columns, we need to do some clever list comprehension

df.select(
    [count(when(col(x).contains('.'), 0)) for x in df.columns]
).collect()

[Row(count(CASE WHEN contains(Rank, .) THEN 0 END)=3836, count(CASE WHEN contains(WeeklyGross, .) THEN 0 END)=0, count(CASE WHEN contains(PctChangeWkGross, .) THEN 0 END)=3625, count(CASE WHEN contains(Theaters, .) THEN 0 END)=3836, count(CASE WHEN contains(DeltaTheaters, .) THEN 0 END)=3389, count(CASE WHEN contains(AvgRev, .) THEN 0 END)=3836, count(CASE WHEN contains(GrossToDate, .) THEN 0 END)=0, count(CASE WHEN contains(Week, .) THEN 0 END)=0, count(CASE WHEN contains(Thursday, .) THEN 0 END)=0, count(CASE WHEN contains(name, .) THEN 0 END)=13, count(CASE WHEN contains(year, .) THEN 0 END)=0, count(CASE WHEN contains(Winner, .) THEN 0 END)=0)]

But that looks crazy gross. alias to the rescue.

dotCounts = df.select(
               [count(when(col(x).contains('.'), 0)).alias(x) for x in df.columns]
            )
dotCounts.show(5)

+----+-----------+----------------+--------+-------------+------+-----------+----+--------+----+----+------+
|Rank|WeeklyGross|PctChangeWkGross|Theaters|DeltaTheaters|AvgRev|GrossToDate|Week|Thursday|name|year|Winner|
+----+-----------+----------------+--------+-------------+------+-----------+----+--------+----+----+------+
|3836|          0|            3625|    3836|         3389|  3836|          0|   0|       0|  13|   0|     0|
+----+-----------+----------------+--------+-------------+------+-----------+----+--------+----+----+------+

Let’s bring it home with some list comprehension

colsWithDots = [c for c in dotCounts.columns
                if dotCounts[[c]].first()[c] == 0]

df.select(colsWithDots).show(5)

+-----------+-----------+----+----------+----+------+
|WeeklyGross|GrossToDate|Week|  Thursday|year|Winner|
+-----------+-----------+----+----------+----+------+
|     967378|     967378|   1|1990-11-18|1990|  True|
|    3871641|    4839019|   2|1990-11-25|1990|  True|
|   12547813|   17386832|   3|1990-12-02|1990|  True|
|    9246632|   26633464|   4|1990-12-09|1990|  True|
|    7272350|   33905814|   5|1990-12-16|1990|  True|
+-----------+-----------+----+----------+----+------+
only showing top 5 rows

Dropping Columns with NULL Data

Same general approach applies to a more practical application of dropping columns with NULL values

from pyspark.sql.functions import isnan, isnull

nullCounts = df.select([count(when(isnan(c)|isnull(c), c)).alias(c) for c in df.columns])

nullCounts.show()

+----+-----------+----------------+--------+-------------+------+-----------+----+--------+----+----+------+
|Rank|WeeklyGross|PctChangeWkGross|Theaters|DeltaTheaters|AvgRev|GrossToDate|Week|Thursday|name|year|Winner|
+----+-----------+----------------+--------+-------------+------+-----------+----+--------+----+----+------+
|   9|          0|             220|       9|          456|     9|          0|   0|       0|   0|   0|     0|
+----+-----------+----------------+--------+-------------+------+-----------+----+--------+----+----+------+

nonNull_cols = [c for c in nullCounts.columns if nullCounts[[c]].first()[c] == 0]

df.select(nonNull_cols).show(5)

+-----------+-----------+----+----------+------------------+----+------+
|WeeklyGross|GrossToDate|Week|  Thursday|              name|year|Winner|
+-----------+-----------+----+----------+------------------+----+------+
|     967378|     967378|   1|1990-11-18|dances with wolves|1990|  True|
|    3871641|    4839019|   2|1990-11-25|dances with wolves|1990|  True|
|   12547813|   17386832|   3|1990-12-02|dances with wolves|1990|  True|
|    9246632|   26633464|   4|1990-12-09|dances with wolves|1990|  True|
|    7272350|   33905814|   5|1990-12-16|dances with wolves|1990|  True|
+-----------+-----------+----+----------+------------------+----+------+
only showing top 5 rows

This is probably handy enough to package into a function

def drop_null_cols(frame, threshold):
    '''
    Drop columns from a `PySpark.DataFrame` that 
    have more than `threshold` NULL values
    '''
    nullCounts = frame.select([count(when(isnan(c)|isnull(c), c))
                               .alias(c) for c in frame.columns])
    nonNullCols = [c for c in nullCounts.columns
                   if nullCounts[[c]].first()[c] < threshold]
    return frame.select(nonNullCols)

drop_null_cols(df, 100).show(5)

+----+-----------+--------+--------+-----------+----+----------+------------------+----+------+
|Rank|WeeklyGross|Theaters|  AvgRev|GrossToDate|Week|  Thursday|              name|year|Winner|
+----+-----------+--------+--------+-----------+----+----------+------------------+----+------+
|17.0|     967378|    14.0| 69098.0|     967378|   1|1990-11-18|dances with wolves|1990|  True|
| 9.0|    3871641|    14.0|276546.0|    4839019|   2|1990-11-25|dances with wolves|1990|  True|
| 3.0|   12547813|  1048.0| 11973.0|   17386832|   3|1990-12-02|dances with wolves|1990|  True|
| 4.0|    9246632|  1053.0|  8781.0|   26633464|   4|1990-12-09|dances with wolves|1990|  True|
| 4.0|    7272350|  1051.0|  6919.0|   33905814|   5|1990-12-16|dances with wolves|1990|  True|
+----+-----------+--------+--------+-----------+----+----------+------------------+----+------+
only showing top 5 rows