GroupBy (as best I understand it)

Table of Contents

Data - Our Dummy Data

Overview - The Basics - Grain - GroupBy Object

Using It - Apply - Transform - Filter

Misc - Grouper Object - Matplotlib - Gotchas - Resources

Our Dummy Data

For the purposes of demonstration, we’re going to borrow the dataset used in this post. It’s basically some generic sales record data with account numbers, client names, prices, and timestamps.

import pandas as pd

dataUrl = r"https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True"
df = pd.read_excel(dataUrl, parse_dates=['date'])
df.head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55

How many different records do we have?

len(df)
1500

In terms of categorical data, it looks like most of our grain will be determined by account number, name, and whatever sku is.

df.nunique()
account number      20
name                20
sku                 30
quantity            50
unit price        1392
ext price         1491
date              1500
dtype: int64

And our date ranges span

print(df['date'].min(), df['date'].max())
2014-01-01 07:21:51 2014-12-31 12:48:35


The Basics

Split, Apply, Combine

At the heart of it, any time you use Group By, it’s because you’re trying to do any of the following:

  • Splitting the data based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

In SQL, this is achieved in one shot with the following syntax

SELECT col1, col2, AVG(col3)
FROM table
GROUP BY Col1, Col2

However, this quickly becomes cumbersome when exploring the data, as each new field you add requires rerunning the query.

By splitting this process up between the split and apply steps, we can stage the groups in memory and flexibly leverage the GroupBy object’s powerful methods, which generally do one of two things:

Aggregation: computing a summary statistic(s) about each group, e.g. per-group:

  • sums
  • means
  • counts

This collapses the data to the grain that the data is grouped at.

Transformation: perform some group-specific computations. e.g.:

  • Standardizing data within a group
  • Filling missing data within groups using a value derived from the group

This returns a like-indexed dataset. So you will wind up with as many records of data as your initial population.


Grain

Under the Hood

Creating a GroupBy object is pretty straight-forward. You give pandas some data and you tell it what to group by.

This creates a DataFrameGroupBy object which is a sub-class of the NDFrameGroupBy class, which is in-turn a sub-class of the GroupBy class.

gb = df.groupby(by='name')
gb
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000AA43048>

Behind the scenes, all that’s happening is that it’s taking the index of the DataFrame

print(df.index)
RangeIndex(start=0, stop=1500, step=1)

And making a simple dict by getting all of the unique values in your group definition as keys

df['name'].unique()
array(['Barton LLC', 'Trantow-Barrows', 'Kulas Inc',
       'Kassulke, Ondricka and Metz', 'Jerde-Hilpert', 'Koepp Ltd',
       'Fritsch, Russel and Anderson', 'Kiehn-Spinka', 'Keeling LLC',
       'Frami, Hills and Schmidt', 'Stokes LLC', 'Kuhn-Gusikowski',
       'Herman LLC', 'White-Trantow', 'Sanford and Sons', 'Pollich LLC',
       'Will LLC', 'Cronin, Oberbrunner and Spencer',
       'Halvorson, Crona and Champlin', 'Purdy-Kunde'], dtype=object)

The corresponding values are just the indexes in the DataFrame that match these keys.

This means that creating new GroupBy objects is both quick and cheap (your underlying data is only saved once!). It also makes it very easy to inspect the data by group.

gb.get_group('Kulas Inc').head()
account number name sku quantity unit price ext price date
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
6 218895 Kulas Inc B1-65551 2 31.10 62.20 2014-01-02 10:57:23
33 218895 Kulas Inc S1-06532 3 22.36 67.08 2014-01-09 23:58:27
36 218895 Kulas Inc S2-34077 16 73.04 1168.64 2014-01-10 12:07:30
43 218895 Kulas Inc B1-50809 43 47.21 2030.03 2014-01-12 01:54:37

After you groupby, every row of the original dataset is going to be accounted for.

sum([len(gb.get_group(x)) for x in gb.groups.keys()])
1500

Grouping

And so with this intuition in mind, you should think of the grouping criteria at the index-level.

For example:

Values in the Data

This comes in two flavors:

1) The raw data

print(len(df.groupby(by=df['name'])))
20

2) Or for convenience, the column name itself

print(len(df.groupby(by='name')))
20

To get even more granular than one field, pass the by argument in as a list of values

print(len(df.groupby(by=['name', 'sku'])))
print(len(df.groupby(by=[df['name'], df['sku']])))
544
544

Numpy Arrays the same length as the axis

Similarly, you can write your own logic as long as the output has the same index as the original data.

Consider a scenario where you define a specific group of clients and want to compare their performance against the rest of the population.

ourGroup = ['Kassulke, Ondricka and Metz', 'Jerde-Hilpert', 'Koepp Ltd']
tf = df['name'].isin(ourGroup)
tf.value_counts()
False    1265
True      235
Name: name, dtype: int64

This returns a Numpy array with True/False values for each row that, of course, have the same index

all(tf.index == df.index)
True

Using this is as easy as

gb = df.groupby(tf)
gb.groups.keys()
dict_keys([False, True])

As you can see, this partitions our data into two groups

print(len(gb.get_group(False)), len(gb.get_group(True)))
1265 235

That match our value_counts above perfectly.

Functions/Maps applied to the axes themselves

Assume that we have a real simple dummy dataset

import numpy as np

abc = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                           'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                              'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})
abc
A B C D
0 foo one 1.265959 0.555853
1 bar one -0.544206 0.358620
2 foo two 1.000534 -0.835887
3 bar three 1.872789 0.444664
4 foo two 1.684510 0.558262
5 bar two 0.970853 0.880734
6 foo one 1.366091 0.760195
7 foo three 1.638540 0.592039

And we want to group by whether the column is a consonant or not

# function implementation
def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

# dict implementation
letter_mapping = {'A': 'vowel', 'B':'consonant', 
                  'C': 'consonant', 'D': 'consonant'}
# axis=1 uses the column headers
grouped1 = abc.groupby(get_letter_type, axis=1)
grouped2 = abc.groupby(letter_mapping, axis=1)

print(grouped1.groups.values())
print(grouped2.groups.values())
dict_values([Index(['B', 'C', 'D'], dtype='object'), Index(['A'], dtype='object')])
dict_values([Index(['B', 'C', 'D'], dtype='object'), Index(['A'], dtype='object')])


GroupBy Properties

Before we get into the Transform discussed above, here’s a quick look at some of the features available at the moment of instantiation.

gb = df.groupby('name')
gb
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000AAEC9E8>

Count of records in each group

gb.size()
name
Barton LLC                         82
Cronin, Oberbrunner and Spencer    67
Frami, Hills and Schmidt           72
Fritsch, Russel and Anderson       81
Halvorson, Crona and Champlin      58
Herman LLC                         62
Jerde-Hilpert                      89
Kassulke, Ondricka and Metz        64
Keeling LLC                        74
Kiehn-Spinka                       79
Koepp Ltd                          82
Kuhn-Gusikowski                    73
Kulas Inc                          94
Pollich LLC                        73
Purdy-Kunde                        53
Sanford and Sons                   71
Stokes LLC                         72
Trantow-Barrows                    94
White-Trantow                      86
Will LLC                           74
dtype: int64

As mentioned earlier, on the backend, all a GroupBy object is is a dict of keys: group, value: index.

Therefore, inspecting what groups exist is simple.

print(gb.groups.keys())
dict_keys(['Barton LLC', 'Cronin, Oberbrunner and Spencer', 'Frami, Hills and Schmidt', 'Fritsch, Russel and Anderson', 'Halvorson, Crona and Champlin', 'Herman LLC', 'Jerde-Hilpert', 'Kassulke, Ondricka and Metz', 'Keeling LLC', 'Kiehn-Spinka', 'Koepp Ltd', 'Kuhn-Gusikowski', 'Kulas Inc', 'Pollich LLC', 'Purdy-Kunde', 'Sanford and Sons', 'Stokes LLC', 'Trantow-Barrows', 'White-Trantow', 'Will LLC'])

as is the distinct number of groups that we have

print(len(gb.groups))
20

You can use the keys to either get the indexes that roll up

gb.groups['Barton LLC']
Int64Index([   0,   85,   91,   96,   99,  105,  117,  125,  158,  182,  183,
             195,  202,  226,  237,  268,  306,  310,  397,  420,  423,  427,
             430,  456,  459,  482,  505,  516,  524,  530,  540,  605,  622,
             635,  636,  646,  650,  652,  666,  667,  691,  718,  733,  740,
             830,  853,  854,  895,  906,  907,  915,  920,  930,  963,  985,
             989, 1008, 1015, 1017, 1019, 1031, 1071, 1077, 1078, 1081, 1094,
            1108, 1132, 1147, 1149, 1158, 1175, 1208, 1218, 1257, 1321, 1330,
            1346, 1350, 1414, 1422, 1473],
           dtype='int64')

or the relevant subset of the base DataFrame

gb.get_group('Barton LLC').head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
85 740150 Barton LLC B1-50809 8 19.60 156.80 2014-01-20 01:48:47
91 740150 Barton LLC B1-53102 1 68.06 68.06 2014-01-20 13:27:52
96 740150 Barton LLC S2-16558 2 90.91 181.82 2014-01-21 21:21:01
99 740150 Barton LLC B1-86481 20 30.41 608.20 2014-01-22 16:33:51

Lastly, using the dict-like properties, we can cleanly iterate through GroupBy objects

from itertools import islice

# only looking through the first two groups
# to save space
for name, group in islice(gb, 2):
    print(name)
    print(type(group))
    print(group.head())
    print('\n'*2)
Barton LLC
<class 'pandas.core.frame.DataFrame'>
    account number        name       sku  quantity  unit price  ext price  \
0           740150  Barton LLC  B1-20000        39       86.69    3380.91   
85          740150  Barton LLC  B1-50809         8       19.60     156.80   
91          740150  Barton LLC  B1-53102         1       68.06      68.06   
96          740150  Barton LLC  S2-16558         2       90.91     181.82   
99          740150  Barton LLC  B1-86481        20       30.41     608.20   

                  date  
0  2014-01-01 07:21:51  
85 2014-01-20 01:48:47  
91 2014-01-20 13:27:52  
96 2014-01-21 21:21:01  
99 2014-01-22 16:33:51  



Cronin, Oberbrunner and Spencer
<class 'pandas.core.frame.DataFrame'>
     account number                             name       sku  quantity  \
51           257198  Cronin, Oberbrunner and Spencer  B1-05914         8   
110          257198  Cronin, Oberbrunner and Spencer  S2-16558        41   
148          257198  Cronin, Oberbrunner and Spencer  S1-30248        23   
168          257198  Cronin, Oberbrunner and Spencer  S1-82801        46   
179          257198  Cronin, Oberbrunner and Spencer  B1-38851        47   

     unit price  ext price                date  
51        23.05     184.40 2014-01-14 01:57:35  
110       23.35     957.35 2014-01-25 23:53:42  
148       13.42     308.66 2014-02-04 19:52:57  
168       25.66    1180.36 2014-02-09 05:16:26  
179       42.82    2012.54 2014-02-11 15:33:17  

Apply

You want to use apply when you mean to consolidate all of your data down to the grain of your groups.

Example: Max - Min per Client group

Say you wanted to know, per client, what the spread of their unit price was, you could do this one of two ways:

1) Making a general function that expects a Series, grouping, then specifying which column will be the series

def spread_series(arr):
    return arr.max() - arr.min()

df.groupby('name')['unit price'].apply(spread_series).head()
name
Barton LLC                         89.24
Cronin, Oberbrunner and Spencer    86.59
Frami, Hills and Schmidt           83.04
Fritsch, Russel and Anderson       88.96
Halvorson, Crona and Champlin      87.95
Name: unit price, dtype: float64

2) Same function, but this time it expects a DataFrame and will filter at the Series level in the function

def spread_df(arr):
    return arr['unit price'].max() - arr['unit price'].min()

df.groupby('name').apply(spread_df).head()
name
Barton LLC                         89.24
Cronin, Oberbrunner and Spencer    86.59
Frami, Hills and Schmidt           83.04
Fritsch, Russel and Anderson       88.96
Halvorson, Crona and Champlin      87.95
dtype: float64

.agg

A hidden third option is to use the swanky agg funtion that will iterate through columns and try to apply your function

df.groupby('name').agg(spread_series).head()
account number quantity unit price ext price date
name
Barton LLC 0 50 89.24 4609.41 356 days 23:54:35
Cronin, Oberbrunner and Spencer 0 46 86.59 4103.63 349 days 12:26:02
Frami, Hills and Schmidt 0 47 83.04 4221.14 360 days 14:50:24
Fritsch, Russel and Anderson 0 48 88.96 3918.36 360 days 18:30:36
Halvorson, Crona and Champlin 0 47 87.95 4082.02 345 days 18:32:02

You can also pass it a list of Series-friendly functions and it will apply all of them

df.groupby('name').agg([spread_series, np.mean, sum]).head()
account number quantity unit price ext price
spread_series mean sum spread_series mean sum spread_series mean sum spread_series mean sum
name
Barton LLC 0 740150 60692300 50 24.890244 2041 89.24 53.769024 4409.06 4609.41 1334.615854 109438.50
Cronin, Oberbrunner and Spencer 0 257198 17232266 46 24.970149 1673 86.59 49.805821 3336.99 4103.63 1339.321642 89734.55
Frami, Hills and Schmidt 0 786968 56661696 47 26.430556 1903 83.04 54.756806 3942.49 4221.14 1438.466528 103569.59
Fritsch, Russel and Anderson 0 737550 59741550 48 26.074074 2112 88.96 53.708765 4350.41 3918.36 1385.366790 112214.71
Halvorson, Crona and Champlin 0 604255 35046790 47 22.137931 1284 87.95 55.946897 3244.92 4082.02 1206.971724 70004.36

Finally, you can pass agg a dict that specifies which functions to call per column.

The usefulness of this functionality in report automation should be pretty straight-forward

def diff_days(arr):
    return (arr.max() - arr.min()).days

result = df.groupby('name').agg({'account number':'mean', 'quantity':'sum', 
                                 'unit price':'mean', 'ext price':'mean',
                                 'date':diff_days})

# Label your data accordingly
result.columns = ['Account Number', 'Total Orders', 'Avg Unit Price', 
                  'Avg Ext Price', 'Days a Client']
result.index.name = 'Client'

result
Account Number Total Orders Avg Unit Price Avg Ext Price Days a Client
Client
Barton LLC 740150 2041 53.769024 1334.615854 356
Cronin, Oberbrunner and Spencer 257198 1673 49.805821 1339.321642 349
Frami, Hills and Schmidt 786968 1903 54.756806 1438.466528 360
Fritsch, Russel and Anderson 737550 2112 53.708765 1385.366790 360
Halvorson, Crona and Champlin 604255 1284 55.946897 1206.971724 345
Herman LLC 141962 1538 52.566935 1336.532258 353
Jerde-Hilpert 412290 1999 52.084719 1265.072247 353
Kassulke, Ondricka and Metz 307599 1647 51.043125 1350.797969 357
Keeling LLC 688981 1806 57.076081 1363.977027 356
Kiehn-Spinka 146832 1756 55.561013 1260.870506 360
Koepp Ltd 729833 1790 54.389756 1264.152927 357
Kuhn-Gusikowski 672390 1665 55.833836 1247.866849 355
Kulas Inc 218895 2265 59.661596 1461.191064 355
Pollich LLC 642753 1707 56.533151 1196.536712 356
Purdy-Kunde 163416 1450 50.340943 1469.777547 335
Sanford and Sons 527099 1704 58.341549 1391.872958 348
Stokes LLC 239344 1766 51.545278 1271.332222 360
Trantow-Barrows 714466 2271 56.180106 1312.567872 362
White-Trantow 424914 2258 58.613140 1579.558023 354
Will LLC 383080 1828 58.632973 1411.318919 354

Transform

Transform is used when you want to maintain the structure of your current DataFrame, but generate new data at the group-level.

Note: The functions that you write need to written to process Series objects.

Ex: Largest order per Client

# apparent 'ext price' has been order total this whole time
df['largest order'] = df.groupby('name')['ext price'].transform(lambda x: x.max())
df.head()
account number name sku quantity unit price ext price date largest order
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 4543.96
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 4444.30
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 4590.81
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 4418.47
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 3791.62

Filling Missing Data

test = df.head(50).copy()
test.head()
account number name sku quantity unit price ext price date largest order
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 4543.96
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 4444.30
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 4590.81
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 4418.47
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 3791.62
test.loc[[0, 2], 'unit price'] = np.nan
test.head()
account number name sku quantity unit price ext price date largest order
0 740150 Barton LLC B1-20000 39 NaN 3380.91 2014-01-01 07:21:51 4543.96
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 4444.30
2 218895 Kulas Inc B1-69924 23 NaN 2086.10 2014-01-01 13:24:58 4590.81
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 4418.47
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 3791.62
def fill_with_mean(arr):
    return arr.mean()

test.groupby('name').transform(fill_with_mean).head()
account number quantity unit price ext price largest order
0 740150 39.000000 NaN 3380.910000 4543.96
1 714466 17.285714 57.258571 1064.988571 4444.30
2 218895 14.666667 53.544000 918.010000 4590.81
3 307599 31.000000 22.176667 743.643333 4418.47
4 412290 17.000000 52.440000 681.120000 3791.62

Outlier Detection

Let’s stuff a ton of normally-distributed dummy data into a DataFrame

data = np.random.normal(size=(100000, 5))
dummy = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

And assign them to arbitrary groups (counting off by 3s)

from itertools import cycle

countoff = cycle([1, 2, 3])
mapping = dict(zip(dummy.index, countoff))
dummy.groupby(mapping).size()
1    33334
2    33333
3    33333
dtype: int64

A good rule of thumb for an outlier is finding the quartiles of each data, $Q_1, Q_2, Q_3, Q_4$ and then fishing for observations, $x$ that are either:

$x < Q_1 - 1.5 * (Q_3 - Q_1)$

and

$x > Q_3 + 1.5 * (Q_3 - Q_1)$

Easy enough!

def find_outliers(arr):
    '''Given a pd.Series of numeric type'''
    q3 = arr.quantile(.75)
    q1 = arr.quantile(.25)
    
    return ((arr < (q1 - (1.5*(q3-q1)))) | (arr > (q3 + (1.5*(q3-q1)))))

outliersByCell = dummy.groupby(mapping).transform(find_outliers)
pd.Series(outliersByCell.values.ravel()).value_counts()
False    496642
True       3358
dtype: int64

Hey, so some cells are outliers! But what rows contained outlier data?

Well we can be cheeky and use the property that items of bool type are represented as True=1 and False=0 on the backend and sum them. Anything greater than 0 has an outlier somewhere.

outliersByRow = outliersByCell.sum(axis=1) > 0
print(outliersByRow.value_counts())
outliersByCell[outliersByRow].head()
False    96684
True      3316
dtype: int64
a b c d e
51 False False False False True
99 False False False True False
145 False False True False False
148 False False False False True
151 False False True False False

So if we’re doing some rudimentary modelling, this will likely be one of our earlier steps.

Next, we want to be able to hold back data that’s going to throw off our model. Which transitions nicely into…

Filter

Anyone comfortable with pandas will likely be used to using Numpy masks to trim down their datasets. However, the filter functionality allows us to be even more specific in our exclusions.

Ex: Filtering Out Low-Participation Clients

If we ever found ourselves wanting to filter out clients that had, arbitrarily, less than 64 orders, we would want to find the top 3 of this series

df.groupby('account number').size().sort_values().head()
account number
163416    53
604255    58
141962    62
307599    64
257198    67
dtype: int64

Filtering this out is easy as writing a filtering function– at the DataFrame level– of things that you want to keep

def no_small_potatoes(x):
    return len(x) > 64

results = df.groupby('account number').filter(no_small_potatoes)['account number']
results.isin([163416, 604255, 141961]).value_counts()
False    1263
Name: account number, dtype: int64
Ex: Clients with High Incidence of Issues

Our dataset had negative ext price values, which seemed to indicate returns/refunds.

df.groupby('account number')['quantity'].apply(lambda x: sum(x < 0)) \
                                        .sort_values(ascending=False)
account number
729833    4
218895    4
424914    3
714466    3
527099    3
688981    2
740150    2
383080    1
146832    1
239344    1
307599    1
141962    1
412290    1
672390    1
604255    0
257198    0
642753    0
163416    0
737550    0
786968    0
Name: quantity, dtype: int64

A few clients had 0 or 1 returns. This is ideal, but suppose we want to zero in on clients 3 or more

def lots_of_returns(x):
    return sum(x['quantity'] < 0) >= 3

results = df.groupby('account number').filter(lots_of_returns)
print(results['account number'].unique())
results.head()
[714466 218895 729833 424914 527099]
account number name sku quantity unit price ext price date largest order
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 4444.30
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 4590.81
5 714466 Trantow-Barrows S2-77896 17 87.63 1489.71 2014-01-02 10:07:15 4444.30
6 218895 Kulas Inc B1-65551 2 31.10 62.20 2014-01-02 10:57:23 4590.81
7 729833 Koepp Ltd S1-30248 8 33.25 266.00 2014-01-03 06:32:11 4708.41


The Grouper Object

The pd.Grouper object largely exists to make groupby operations with datetime values less of a headache.

df.head()
account number name sku quantity unit price ext price date largest order
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 4543.96
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 4444.30
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 4590.81
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 4418.47
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 3791.62

The old way

# resample only works on an index, not a column
df.set_index('date').resample('M')['ext price'].agg(['sum', 'count'])
sum count
date
2014-01-31 185361.66 134
2014-02-28 146211.62 108
2014-03-31 203921.38 142
2014-04-30 174574.11 134
2014-05-31 165418.55 132
2014-06-30 174089.33 128
2014-07-31 191662.11 128
2014-08-31 153778.59 117
2014-09-30 168443.17 118
2014-10-31 171495.32 126
2014-11-30 119961.22 114
2014-12-31 163867.26 119

When you introduce regular groupby operations, the syntax gets cumbersome.

df.set_index('date').groupby('name')['ext price'].resample('M').sum().head(20)
name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

Instead

We can do all of the group definition in one shot with Grouper

df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum().head(20)
name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

Making this an annual aggregation is as easy as changing the freq argument

df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum().head(20)
name                             date      
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64

Matplotlib

GroupBy objects also play pretty well with matplotlib… with some finagling

%pylab inline
Populating the interactive namespace from numpy and matplotlib


C:\Users\nhounshell\AppData\Local\Continuum\Anaconda3\lib\site-packages\IPython\core\magics\pylab.py:160: UserWarning: pylab import has clobbered these variables: ['test']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"
df.head()
account number name sku quantity unit price ext price date largest order
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 4543.96
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 4444.30
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 4590.81
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 4418.47
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 3791.62

What if we wanted to inspect the count of issues vs the count of orders by month?

fig, ax = plt.subplots(figsize=(12, 8))

monthlyClip = pd.Grouper(key='date', freq='M')

problemOrder = df['quantity'] < 0
problemsPerMonth = df.groupby([problemOrder, monthlyClip]).size()[True]
problemsPerMonth.plot(ax=ax, c='b', label='Issues')
ax.legend(loc=2)

ordersPerMonth = df.groupby(monthlyClip).size()
ax2 = ax.twinx()
ax2.plot(ordersPerMonth, c='orange', label='Orders')
ax2.legend(loc=0)
<matplotlib.legend.Legend at 0xdbca400>

png

Or if we wanted to see if there was a discernable difference in the price/quantity behavior of our 5 biggest spenders?

fiveBiggestSpenders = (df.groupby(['name'])['unit price']
                          .sum()
                          .sort_values(ascending=False)
                          .index[:5])
gb = df.groupby(df['name'].isin(fiveBiggestSpenders))

fig, ax = plt.subplots(figsize=(12, 8))


for name, group in gb:
    ax.scatter(group['unit price'], group['quantity'], alpha=.5, label=name)
    
plt.legend()
<matplotlib.legend.Legend at 0xdfad0f0>

png

Or wanted to examine our cumulative earnings per Client over the course of the year?

gb = df.groupby('name')

fig, ax = plt.subplots(figsize=(12, 8))

for name, group in gb:
    ax.plot(group['unit price'].cumsum(), label=name)

ax.legend()
<matplotlib.legend.Legend at 0xdd99828>

png


Gotchas

Multi-Index

Suppose that we wanted to know what day of the week each client did business with us on.

This is, as discussed above:

1) Creating a Numpy array with the same index as the DataFrame representing day of week 2) Grouping by that and the client’s name

dow = df['date'].dt.dayofweek

df.groupby(['name', dow]).size().head(20)
name                             date
Barton LLC                       0        8
                                 1       12
                                 2       15
                                 3        8
                                 4       20
                                 5       10
                                 6        9
Cronin, Oberbrunner and Spencer  0       12
                                 1       12
                                 2        7
                                 3        8
                                 4       11
                                 5       10
                                 6        7
Frami, Hills and Schmidt         0       11
                                 1        8
                                 2       10
                                 3        9
                                 4       12
                                 5        7
dtype: int64

But this is damn gross. Furthermore, we want the values in the date column to be the column headers.

What we want is the unstack function. Let’s try that again.

df.groupby(['name', dow]).size().unstack()
date 0 1 2 3 4 5 6
name
Barton LLC 8 12 15 8 20 10 9
Cronin, Oberbrunner and Spencer 12 12 7 8 11 10 7
Frami, Hills and Schmidt 11 8 10 9 12 7 15
Fritsch, Russel and Anderson 8 12 14 12 11 15 9
Halvorson, Crona and Champlin 11 8 5 6 12 5 11
Herman LLC 11 8 7 8 8 9 11
Jerde-Hilpert 12 22 16 12 8 8 11
Kassulke, Ondricka and Metz 8 9 10 10 8 8 11
Keeling LLC 15 12 10 11 8 10 8
Kiehn-Spinka 10 8 9 17 13 14 8
Koepp Ltd 11 7 16 9 18 11 10
Kuhn-Gusikowski 16 9 7 7 17 7 10
Kulas Inc 13 21 15 13 12 12 8
Pollich LLC 7 11 7 10 16 8 14
Purdy-Kunde 9 8 8 7 6 5 10
Sanford and Sons 10 8 11 7 10 16 9
Stokes LLC 2 11 10 11 18 7 13
Trantow-Barrows 16 13 9 12 21 11 12
White-Trantow 10 17 12 13 12 14 8
Will LLC 14 13 12 10 9 10 6

Much better. Similarly, we could have made a unwieldly wide table instead by either reversing the arguments in the groupby

df.groupby([dow, 'name']).size().unstack()
name Barton LLC Cronin, Oberbrunner and Spencer Frami, Hills and Schmidt Fritsch, Russel and Anderson Halvorson, Crona and Champlin Herman LLC Jerde-Hilpert Kassulke, Ondricka and Metz Keeling LLC Kiehn-Spinka Koepp Ltd Kuhn-Gusikowski Kulas Inc Pollich LLC Purdy-Kunde Sanford and Sons Stokes LLC Trantow-Barrows White-Trantow Will LLC
date
0 8 12 11 8 11 11 12 8 15 10 11 16 13 7 9 10 2 16 10 14
1 12 12 8 12 8 8 22 9 12 8 7 9 21 11 8 8 11 13 17 13
2 15 7 10 14 5 7 16 10 10 9 16 7 15 7 8 11 10 9 12 12
3 8 8 9 12 6 8 12 10 11 17 9 7 13 10 7 7 11 12 13 10
4 20 11 12 11 12 8 8 8 8 13 18 17 12 16 6 10 18 21 12 9
5 10 10 7 15 5 9 8 8 10 14 11 7 12 8 5 16 7 11 14 10
6 9 7 15 9 11 11 11 11 8 8 10 10 8 14 10 9 13 12 8 6

Or specifying that we want level=0 instead of the default 1 in unstack

df.groupby(['name', dow]).size().unstack(level=0)
name Barton LLC Cronin, Oberbrunner and Spencer Frami, Hills and Schmidt Fritsch, Russel and Anderson Halvorson, Crona and Champlin Herman LLC Jerde-Hilpert Kassulke, Ondricka and Metz Keeling LLC Kiehn-Spinka Koepp Ltd Kuhn-Gusikowski Kulas Inc Pollich LLC Purdy-Kunde Sanford and Sons Stokes LLC Trantow-Barrows White-Trantow Will LLC
date
0 8 12 11 8 11 11 12 8 15 10 11 16 13 7 9 10 2 16 10 14
1 12 12 8 12 8 8 22 9 12 8 7 9 21 11 8 8 11 13 17 13
2 15 7 10 14 5 7 16 10 10 9 16 7 15 7 8 11 10 9 12 12
3 8 8 9 12 6 8 12 10 11 17 9 7 13 10 7 7 11 12 13 10
4 20 11 12 11 12 8 8 8 8 13 18 17 12 16 6 10 18 21 12 9
5 10 10 7 15 5 9 8 8 10 14 11 7 12 8 5 16 7 11 14 10
6 9 7 15 9 11 11 11 11 8 8 10 10 8 14 10 9 13 12 8 6

.apply, .map, and .applymap

These 3 functions are used to manipulate data in pandas objects, but until you’ve used them a bunch, it’s often confusing which one you want in order to solve a given problem.

A decent-enough rule of thumb I use is:

.apply applies a function across a DataFrame

df.apply(max)
account number                 786968
name                         Will LLC
sku                          S2-83881
quantity                           49
unit price                      99.85
ext price                     4824.54
date              2014-12-31 12:48:35
largest order                 4824.54
dtype: object

.map applies a mapping function to each value in a Series

df['name'].map(len).head()
0    10
1    15
2     9
3    27
4    13
Name: name, dtype: int64

So when considering .applymap, it might help to think about how

apply (1) + map (2) = applymap

therefore, you can “add” their functionality as well:

.applymap applies a mapping to each value (2) of a given DataFrame (1)

df.applymap(type).head()
account number name sku quantity unit price ext price date largest order
0 <class 'int'> <class 'str'> <class 'str'> <class 'int'> <class 'float'> <class 'float'> <class 'pandas._libs.tslib.Timestamp'> <class 'float'>
1 <class 'int'> <class 'str'> <class 'str'> <class 'int'> <class 'float'> <class 'float'> <class 'pandas._libs.tslib.Timestamp'> <class 'float'>
2 <class 'int'> <class 'str'> <class 'str'> <class 'int'> <class 'float'> <class 'float'> <class 'pandas._libs.tslib.Timestamp'> <class 'float'>
3 <class 'int'> <class 'str'> <class 'str'> <class 'int'> <class 'float'> <class 'float'> <class 'pandas._libs.tslib.Timestamp'> <class 'float'>
4 <class 'int'> <class 'str'> <class 'str'> <class 'int'> <class 'float'> <class 'float'> <class 'pandas._libs.tslib.Timestamp'> <class 'float'>


Resources