GroupBy (as best I understand it)

23 May 2018

Overview - The Basics - Grain - GroupBy Object

Misc - Grouper Object - Matplotlib - Gotchas - Resources

Our Dummy Data

For the purposes of demonstration, we’re going to borrow the dataset used in this post. It’s basically some generic sales record data with account numbers, client names, prices, and timestamps.

import pandas as pd

dataUrl = r"https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True"
df = pd.read_excel(dataUrl, parse_dates=['date'])
df.head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

How many different records do we have?

len(df)

In terms of categorical data, it looks like most of our grain will be determined by account number, name, and whatever sku is.

df.nunique()

account number      20
name                20
sku                 30
quantity            50
unit price        1392
ext price         1491
date              1500
dtype: int64

And our date ranges span

print(df['date'].min(), df['date'].max())

2014-01-01 07:21:51 2014-12-31 12:48:35

The Basics

Split, Apply, Combine

At the heart of it, any time you use Group By, it’s because you’re trying to do any of the following:

Splitting the data based on some criteria
Applying a function to each group independently
Combining the results into a data structure

In SQL, this is achieved in one shot with the following syntax

SELECT col1, col2, AVG(col3)
FROM table
GROUP BY Col1, Col2

However, this quickly becomes cumbersome when exploring the data, as each new field you add requires rerunning the query.

By splitting this process up between the split and apply steps, we can stage the groups in memory and flexibly leverage the GroupBy object’s powerful methods, which generally do one of two things:

Aggregation: computing a summary statistic(s) about each group, e.g. per-group:

sums
means
counts

This collapses the data to the grain that the data is grouped at.

Transformation: perform some group-specific computations. e.g.:

Standardizing data within a group
Filling missing data within groups using a value derived from the group

This returns a like-indexed dataset. So you will wind up with as many records of data as your initial population.

Grain

Under the Hood

Creating a GroupBy object is pretty straight-forward. You give pandas some data and you tell it what to group by.

This creates a DataFrameGroupBy object which is a sub-class of the NDFrameGroupBy class, which is in-turn a sub-class of the GroupBy class.

gb = df.groupby(by='name')
gb

<pandas.core.groupby.DataFrameGroupBy object at 0x000000000AA43048>

Behind the scenes, all that’s happening is that it’s taking the index of the DataFrame

print(df.index)

RangeIndex(start=0, stop=1500, step=1)

And making a simple dict by getting all of the unique values in your group definition as keys

df['name'].unique()

array(['Barton LLC', 'Trantow-Barrows', 'Kulas Inc',
       'Kassulke, Ondricka and Metz', 'Jerde-Hilpert', 'Koepp Ltd',
       'Fritsch, Russel and Anderson', 'Kiehn-Spinka', 'Keeling LLC',
       'Frami, Hills and Schmidt', 'Stokes LLC', 'Kuhn-Gusikowski',
       'Herman LLC', 'White-Trantow', 'Sanford and Sons', 'Pollich LLC',
       'Will LLC', 'Cronin, Oberbrunner and Spencer',
       'Halvorson, Crona and Champlin', 'Purdy-Kunde'], dtype=object)

The corresponding values are just the indexes in the DataFrame that match these keys.

This means that creating new GroupBy objects is both quick and cheap (your underlying data is only saved once!). It also makes it very easy to inspect the data by group.

gb.get_group('Kulas Inc').head()

	account number	name	sku	quantity	unit price	ext price	date
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
33	218895	Kulas Inc	S1-06532	3	22.36	67.08	2014-01-09 23:58:27
36	218895	Kulas Inc	S2-34077	16	73.04	1168.64	2014-01-10 12:07:30
43	218895	Kulas Inc	B1-50809	43	47.21	2030.03	2014-01-12 01:54:37

After you groupby, every row of the original dataset is going to be accounted for.

sum([len(gb.get_group(x)) for x in gb.groups.keys()])

Grouping

And so with this intuition in mind, you should think of the grouping criteria at the index-level.

For example:

Values in the Data

This comes in two flavors:

1) The raw data

print(len(df.groupby(by=df['name'])))

2) Or for convenience, the column name itself

print(len(df.groupby(by='name')))

To get even more granular than one field, pass the by argument in as a list of values

print(len(df.groupby(by=['name', 'sku'])))
print(len(df.groupby(by=[df['name'], df['sku']])))

544
544

Numpy Arrays the same length as the axis

Similarly, you can write your own logic as long as the output has the same index as the original data.

Consider a scenario where you define a specific group of clients and want to compare their performance against the rest of the population.

ourGroup = ['Kassulke, Ondricka and Metz', 'Jerde-Hilpert', 'Koepp Ltd']
tf = df['name'].isin(ourGroup)
tf.value_counts()

False    1265
True      235
Name: name, dtype: int64

This returns a Numpy array with True/False values for each row that, of course, have the same index

all(tf.index == df.index)

True

Using this is as easy as

gb = df.groupby(tf)
gb.groups.keys()

dict_keys([False, True])

As you can see, this partitions our data into two groups

print(len(gb.get_group(False)), len(gb.get_group(True)))

1265 235

That match our value_counts above perfectly.

Functions/Maps applied to the axes themselves

Assume that we have a real simple dummy dataset

import numpy as np

abc = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                           'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                              'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})
abc

	A	B	C	D
0	foo	one	1.265959	0.555853
1	bar	one	-0.544206	0.358620
2	foo	two	1.000534	-0.835887
3	bar	three	1.872789	0.444664
4	foo	two	1.684510	0.558262
5	bar	two	0.970853	0.880734
6	foo	one	1.366091	0.760195
7	foo	three	1.638540	0.592039

And we want to group by whether the column is a consonant or not

# function implementation
def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

# dict implementation
letter_mapping = {'A': 'vowel', 'B':'consonant', 
                  'C': 'consonant', 'D': 'consonant'}

# axis=1 uses the column headers
grouped1 = abc.groupby(get_letter_type, axis=1)
grouped2 = abc.groupby(letter_mapping, axis=1)

print(grouped1.groups.values())
print(grouped2.groups.values())

dict_values([Index(['B', 'C', 'D'], dtype='object'), Index(['A'], dtype='object')])
dict_values([Index(['B', 'C', 'D'], dtype='object'), Index(['A'], dtype='object')])

GroupBy Properties

Before we get into the Transform discussed above, here’s a quick look at some of the features available at the moment of instantiation.

gb = df.groupby('name')
gb

<pandas.core.groupby.DataFrameGroupBy object at 0x000000000AAEC9E8>

Count of records in each group

gb.size()

name
Barton LLC                         82
Cronin, Oberbrunner and Spencer    67
Frami, Hills and Schmidt           72
Fritsch, Russel and Anderson       81
Halvorson, Crona and Champlin      58
Herman LLC                         62
Jerde-Hilpert                      89
Kassulke, Ondricka and Metz        64
Keeling LLC                        74
Kiehn-Spinka                       79
Koepp Ltd                          82
Kuhn-Gusikowski                    73
Kulas Inc                          94
Pollich LLC                        73
Purdy-Kunde                        53
Sanford and Sons                   71
Stokes LLC                         72
Trantow-Barrows                    94
White-Trantow                      86
Will LLC                           74
dtype: int64

As mentioned earlier, on the backend, all a GroupBy object is is a dict of keys: group, value: index.

Therefore, inspecting what groups exist is simple.

print(gb.groups.keys())

dict_keys(['Barton LLC', 'Cronin, Oberbrunner and Spencer', 'Frami, Hills and Schmidt', 'Fritsch, Russel and Anderson', 'Halvorson, Crona and Champlin', 'Herman LLC', 'Jerde-Hilpert', 'Kassulke, Ondricka and Metz', 'Keeling LLC', 'Kiehn-Spinka', 'Koepp Ltd', 'Kuhn-Gusikowski', 'Kulas Inc', 'Pollich LLC', 'Purdy-Kunde', 'Sanford and Sons', 'Stokes LLC', 'Trantow-Barrows', 'White-Trantow', 'Will LLC'])

as is the distinct number of groups that we have

print(len(gb.groups))

You can use the keys to either get the indexes that roll up

gb.groups['Barton LLC']

Int64Index([   0,   85,   91,   96,   99,  105,  117,  125,  158,  182,  183,
             195,  202,  226,  237,  268,  306,  310,  397,  420,  423,  427,
             430,  456,  459,  482,  505,  516,  524,  530,  540,  605,  622,
             635,  636,  646,  650,  652,  666,  667,  691,  718,  733,  740,
             830,  853,  854,  895,  906,  907,  915,  920,  930,  963,  985,
             989, 1008, 1015, 1017, 1019, 1031, 1071, 1077, 1078, 1081, 1094,
            1108, 1132, 1147, 1149, 1158, 1175, 1208, 1218, 1257, 1321, 1330,
            1346, 1350, 1414, 1422, 1473],
           dtype='int64')

or the relevant subset of the base DataFrame

gb.get_group('Barton LLC').head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
85	740150	Barton LLC	B1-50809	8	19.60	156.80	2014-01-20 01:48:47
91	740150	Barton LLC	B1-53102	1	68.06	68.06	2014-01-20 13:27:52
96	740150	Barton LLC	S2-16558	2	90.91	181.82	2014-01-21 21:21:01
99	740150	Barton LLC	B1-86481	20	30.41	608.20	2014-01-22 16:33:51

Lastly, using the dict-like properties, we can cleanly iterate through GroupBy objects

from itertools import islice

# only looking through the first two groups
# to save space
for name, group in islice(gb, 2):
    print(name)
    print(type(group))
    print(group.head())
    print('\n'*2)

Barton LLC
<class 'pandas.core.frame.DataFrame'>
    account number        name       sku  quantity  unit price  ext price  \
0           740150  Barton LLC  B1-20000        39       86.69    3380.91   
85          740150  Barton LLC  B1-50809         8       19.60     156.80   
91          740150  Barton LLC  B1-53102         1       68.06      68.06   
96          740150  Barton LLC  S2-16558         2       90.91     181.82   
99          740150  Barton LLC  B1-86481        20       30.41     608.20   

                  date  
0  2014-01-01 07:21:51  
85 2014-01-20 01:48:47  
91 2014-01-20 13:27:52  
96 2014-01-21 21:21:01  
99 2014-01-22 16:33:51  



Cronin, Oberbrunner and Spencer
<class 'pandas.core.frame.DataFrame'>
     account number                             name       sku  quantity  \
51           257198  Cronin, Oberbrunner and Spencer  B1-05914         8   
110          257198  Cronin, Oberbrunner and Spencer  S2-16558        41   
148          257198  Cronin, Oberbrunner and Spencer  S1-30248        23   
168          257198  Cronin, Oberbrunner and Spencer  S1-82801        46   
179          257198  Cronin, Oberbrunner and Spencer  B1-38851        47   

     unit price  ext price                date  
51        23.05     184.40 2014-01-14 01:57:35  
110       23.35     957.35 2014-01-25 23:53:42  
148       13.42     308.66 2014-02-04 19:52:57  
168       25.66    1180.36 2014-02-09 05:16:26  
179       42.82    2012.54 2014-02-11 15:33:17

Apply

You want to use apply when you mean to consolidate all of your data down to the grain of your groups.

Example: Max - Min per Client group

Say you wanted to know, per client, what the spread of their unit price was, you could do this one of two ways:

1) Making a general function that expects a Series, grouping, then specifying which column will be the series

def spread_series(arr):
    return arr.max() - arr.min()

df.groupby('name')['unit price'].apply(spread_series).head()

name
Barton LLC                         89.24
Cronin, Oberbrunner and Spencer    86.59
Frami, Hills and Schmidt           83.04
Fritsch, Russel and Anderson       88.96
Halvorson, Crona and Champlin      87.95
Name: unit price, dtype: float64

2) Same function, but this time it expects a DataFrame and will filter at the Series level in the function

def spread_df(arr):
    return arr['unit price'].max() - arr['unit price'].min()

df.groupby('name').apply(spread_df).head()

name
Barton LLC                         89.24
Cronin, Oberbrunner and Spencer    86.59
Frami, Hills and Schmidt           83.04
Fritsch, Russel and Anderson       88.96
Halvorson, Crona and Champlin      87.95
dtype: float64

.agg

A hidden third option is to use the swanky agg funtion that will iterate through columns and try to apply your function

df.groupby('name').agg(spread_series).head()

	account number	quantity	unit price	ext price	date
name
Barton LLC	0	50	89.24	4609.41	356 days 23:54:35
Cronin, Oberbrunner and Spencer	0	46	86.59	4103.63	349 days 12:26:02
Frami, Hills and Schmidt	0	47	83.04	4221.14	360 days 14:50:24
Fritsch, Russel and Anderson	0	48	88.96	3918.36	360 days 18:30:36
Halvorson, Crona and Champlin	0	47	87.95	4082.02	345 days 18:32:02

You can also pass it a list of Series-friendly functions and it will apply all of them

df.groupby('name').agg([spread_series, np.mean, sum]).head()

	account number			quantity			unit price			ext price
	spread_series	mean	sum	spread_series	mean	sum	spread_series	mean	sum	spread_series	mean	sum
name
Barton LLC	0	740150	60692300	50	24.890244	2041	89.24	53.769024	4409.06	4609.41	1334.615854	109438.50
Cronin, Oberbrunner and Spencer	0	257198	17232266	46	24.970149	1673	86.59	49.805821	3336.99	4103.63	1339.321642	89734.55
Frami, Hills and Schmidt	0	786968	56661696	47	26.430556	1903	83.04	54.756806	3942.49	4221.14	1438.466528	103569.59
Fritsch, Russel and Anderson	0	737550	59741550	48	26.074074	2112	88.96	53.708765	4350.41	3918.36	1385.366790	112214.71
Halvorson, Crona and Champlin	0	604255	35046790	47	22.137931	1284	87.95	55.946897	3244.92	4082.02	1206.971724	70004.36

Finally, you can pass agg a dict that specifies which functions to call per column.

The usefulness of this functionality in report automation should be pretty straight-forward

def diff_days(arr):
    return (arr.max() - arr.min()).days

result = df.groupby('name').agg({'account number':'mean', 'quantity':'sum', 
                                 'unit price':'mean', 'ext price':'mean',
                                 'date':diff_days})

# Label your data accordingly
result.columns = ['Account Number', 'Total Orders', 'Avg Unit Price', 
                  'Avg Ext Price', 'Days a Client']
result.index.name = 'Client'

result

	Account Number	Total Orders	Avg Unit Price	Avg Ext Price	Days a Client
Client
Barton LLC	740150	2041	53.769024	1334.615854	356
Cronin, Oberbrunner and Spencer	257198	1673	49.805821	1339.321642	349
Frami, Hills and Schmidt	786968	1903	54.756806	1438.466528	360
Fritsch, Russel and Anderson	737550	2112	53.708765	1385.366790	360
Halvorson, Crona and Champlin	604255	1284	55.946897	1206.971724	345
Herman LLC	141962	1538	52.566935	1336.532258	353
Jerde-Hilpert	412290	1999	52.084719	1265.072247	353
Kassulke, Ondricka and Metz	307599	1647	51.043125	1350.797969	357
Keeling LLC	688981	1806	57.076081	1363.977027	356
Kiehn-Spinka	146832	1756	55.561013	1260.870506	360
Koepp Ltd	729833	1790	54.389756	1264.152927	357
Kuhn-Gusikowski	672390	1665	55.833836	1247.866849	355
Kulas Inc	218895	2265	59.661596	1461.191064	355
Pollich LLC	642753	1707	56.533151	1196.536712	356
Purdy-Kunde	163416	1450	50.340943	1469.777547	335
Sanford and Sons	527099	1704	58.341549	1391.872958	348
Stokes LLC	239344	1766	51.545278	1271.332222	360
Trantow-Barrows	714466	2271	56.180106	1312.567872	362
White-Trantow	424914	2258	58.613140	1579.558023	354
Will LLC	383080	1828	58.632973	1411.318919	354

Transform

Transform is used when you want to maintain the structure of your current DataFrame, but generate new data at the group-level.

Note: The functions that you write need to written to process Series objects.

Ex: Largest order per Client

# apparent 'ext price' has been order total this whole time
df['largest order'] = df.groupby('name')['ext price'].transform(lambda x: x.max())
df.head()

	account number	name	sku	quantity	unit price	ext price	date	largest order
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51	4543.96
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	4444.30
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58	4590.81
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	4418.47
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	3791.62

Filling Missing Data

test = df.head(50).copy()
test.head()

	account number	name	sku	quantity	unit price	ext price	date	largest order
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51	4543.96
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	4444.30
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58	4590.81
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	4418.47
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	3791.62

test.loc[[0, 2], 'unit price'] = np.nan
test.head()

	account number	name	sku	quantity	unit price	ext price	date	largest order
0	740150	Barton LLC	B1-20000	39	NaN	3380.91	2014-01-01 07:21:51	4543.96
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	4444.30
2	218895	Kulas Inc	B1-69924	23	NaN	2086.10	2014-01-01 13:24:58	4590.81
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	4418.47
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	3791.62

def fill_with_mean(arr):
    return arr.mean()

test.groupby('name').transform(fill_with_mean).head()

	account number	quantity	unit price	ext price	largest order
0	740150	39.000000	NaN	3380.910000	4543.96
1	714466	17.285714	57.258571	1064.988571	4444.30
2	218895	14.666667	53.544000	918.010000	4590.81
3	307599	31.000000	22.176667	743.643333	4418.47
4	412290	17.000000	52.440000	681.120000	3791.62

Outlier Detection

Let’s stuff a ton of normally-distributed dummy data into a DataFrame

data = np.random.normal(size=(100000, 5))
dummy = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

And assign them to arbitrary groups (counting off by 3s)

from itertools import cycle

countoff = cycle([1, 2, 3])
mapping = dict(zip(dummy.index, countoff))
dummy.groupby(mapping).size()

1    33334
2    33333
3    33333
dtype: int64

A good rule of thumb for an outlier is finding the quartiles of each data, $Q_1, Q_2, Q_3, Q_4$ and then fishing for observations, $x$ that are either:

$x < Q_1 - 1.5 * (Q_3 - Q_1)$

and

$x > Q_3 + 1.5 * (Q_3 - Q_1)$

Easy enough!

def find_outliers(arr):
    '''Given a pd.Series of numeric type'''
    q3 = arr.quantile(.75)
    q1 = arr.quantile(.25)
    
    return ((arr < (q1 - (1.5*(q3-q1)))) | (arr > (q3 + (1.5*(q3-q1)))))

outliersByCell = dummy.groupby(mapping).transform(find_outliers)
pd.Series(outliersByCell.values.ravel()).value_counts()

False    496642
True       3358
dtype: int64

Hey, so some cells are outliers! But what rows contained outlier data?

Well we can be cheeky and use the property that items of bool type are represented as True=1 and False=0 on the backend and sum them. Anything greater than 0 has an outlier somewhere.

outliersByRow = outliersByCell.sum(axis=1) > 0
print(outliersByRow.value_counts())
outliersByCell[outliersByRow].head()

False    96684
True      3316
dtype: int64

	a	b	c	d	e
51	False	False	False	False	True
99	False	False	False	True	False
145	False	False	True	False	False
148	False	False	False	False	True
151	False	False	True	False	False

So if we’re doing some rudimentary modelling, this will likely be one of our earlier steps.

Next, we want to be able to hold back data that’s going to throw off our model. Which transitions nicely into…

Filter

Anyone comfortable with pandas will likely be used to using Numpy masks to trim down their datasets. However, the filter functionality allows us to be even more specific in our exclusions.

Ex: Filtering Out Low-Participation Clients

If we ever found ourselves wanting to filter out clients that had, arbitrarily, less than 64 orders, we would want to find the top 3 of this series

df.groupby('account number').size().sort_values().head()

account number
163416    53
604255    58
141962    62
307599    64
257198    67
dtype: int64

Filtering this out is easy as writing a filtering function– at the DataFrame level– of things that you want to keep

def no_small_potatoes(x):
    return len(x) > 64

results = df.groupby('account number').filter(no_small_potatoes)['account number']
results.isin([163416, 604255, 141961]).value_counts()

False    1263
Name: account number, dtype: int64

Ex: Clients with High Incidence of Issues

Our dataset had negative ext price values, which seemed to indicate returns/refunds.

df.groupby('account number')['quantity'].apply(lambda x: sum(x < 0)) \
                                        .sort_values(ascending=False)

account number
729833    4
218895    4
424914    3
714466    3
527099    3
688981    2
740150    2
383080    1
146832    1
239344    1
307599    1
141962    1
412290    1
672390    1
604255    0
257198    0
642753    0
163416    0
737550    0
786968    0
Name: quantity, dtype: int64

A few clients had 0 or 1 returns. This is ideal, but suppose we want to zero in on clients 3 or more

def lots_of_returns(x):
    return sum(x['quantity'] < 0) >= 3

results = df.groupby('account number').filter(lots_of_returns)
print(results['account number'].unique())
results.head()

[714466 218895 729833 424914 527099]

	account number	name	sku	quantity	unit price	ext price	date	largest order
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	4444.30
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58	4590.81
5	714466	Trantow-Barrows	S2-77896	17	87.63	1489.71	2014-01-02 10:07:15	4444.30
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23	4590.81
7	729833	Koepp Ltd	S1-30248	8	33.25	266.00	2014-01-03 06:32:11	4708.41

The `Grouper` Object

The pd.Grouper object largely exists to make groupby operations with datetime values less of a headache.

df.head()

	account number	name	sku	quantity	unit price	ext price	date	largest order
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51	4543.96
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	4444.30
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58	4590.81
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	4418.47
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	3791.62

The old way

# resample only works on an index, not a column
df.set_index('date').resample('M')['ext price'].agg(['sum', 'count'])

	sum	count
date
2014-01-31	185361.66	134
2014-02-28	146211.62	108
2014-03-31	203921.38	142
2014-04-30	174574.11	134
2014-05-31	165418.55	132
2014-06-30	174089.33	128
2014-07-31	191662.11	128
2014-08-31	153778.59	117
2014-09-30	168443.17	118
2014-10-31	171495.32	126
2014-11-30	119961.22	114
2014-12-31	163867.26	119

When you introduce regular groupby operations, the syntax gets cumbersome.

df.set_index('date').groupby('name')['ext price'].resample('M').sum().head(20)

name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

Instead

We can do all of the group definition in one shot with Grouper

df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum().head(20)

name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

Making this an annual aggregation is as easy as changing the freq argument

df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum().head(20)

name                             date      
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64

Matplotlib

GroupBy objects also play pretty well with matplotlib… with some finagling

%pylab inline

Populating the interactive namespace from numpy and matplotlib


C:\Users\nhounshell\AppData\Local\Continuum\Anaconda3\lib\site-packages\IPython\core\magics\pylab.py:160: UserWarning: pylab import has clobbered these variables: ['test']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

df.head()

	account number	name	sku	quantity	unit price	ext price	date	largest order
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51	4543.96
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	4444.30
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58	4590.81
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	4418.47
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	3791.62

What if we wanted to inspect the count of issues vs the count of orders by month?

fig, ax = plt.subplots(figsize=(12, 8))

monthlyClip = pd.Grouper(key='date', freq='M')

problemOrder = df['quantity'] < 0
problemsPerMonth = df.groupby([problemOrder, monthlyClip]).size()[True]
problemsPerMonth.plot(ax=ax, c='b', label='Issues')
ax.legend(loc=2)

ordersPerMonth = df.groupby(monthlyClip).size()
ax2 = ax.twinx()
ax2.plot(ordersPerMonth, c='orange', label='Orders')
ax2.legend(loc=0)

<matplotlib.legend.Legend at 0xdbca400>

png

Or if we wanted to see if there was a discernable difference in the price/quantity behavior of our 5 biggest spenders?

fiveBiggestSpenders = (df.groupby(['name'])['unit price']
                          .sum()
                          .sort_values(ascending=False)
                          .index[:5])
gb = df.groupby(df['name'].isin(fiveBiggestSpenders))

fig, ax = plt.subplots(figsize=(12, 8))


for name, group in gb:
    ax.scatter(group['unit price'], group['quantity'], alpha=.5, label=name)
    
plt.legend()

<matplotlib.legend.Legend at 0xdfad0f0>

png

Or wanted to examine our cumulative earnings per Client over the course of the year?

gb = df.groupby('name')

fig, ax = plt.subplots(figsize=(12, 8))

for name, group in gb:
    ax.plot(group['unit price'].cumsum(), label=name)

ax.legend()

<matplotlib.legend.Legend at 0xdd99828>

png

Gotchas

Multi-Index

Suppose that we wanted to know what day of the week each client did business with us on.

This is, as discussed above:

1) Creating a Numpy array with the same index as the DataFrame representing day of week 2) Grouping by that and the client’s name

dow = df['date'].dt.dayofweek

df.groupby(['name', dow]).size().head(20)

name                             date
Barton LLC                       0        8
                                 1       12
                                 2       15
                                 3        8
                                 4       20
                                 5       10
                                 6        9
Cronin, Oberbrunner and Spencer  0       12
                                 1       12
                                 2        7
                                 3        8
                                 4       11
                                 5       10
                                 6        7
Frami, Hills and Schmidt         0       11
                                 1        8
                                 2       10
                                 3        9
                                 4       12
                                 5        7
dtype: int64

But this is damn gross. Furthermore, we want the values in the date column to be the column headers.

What we want is the unstack function. Let’s try that again.

df.groupby(['name', dow]).size().unstack()

date	0	1	2	3	4	5	6
name
Barton LLC	8	12	15	8	20	10	9
Cronin, Oberbrunner and Spencer	12	12	7	8	11	10	7
Frami, Hills and Schmidt	11	8	10	9	12	7	15
Fritsch, Russel and Anderson	8	12	14	12	11	15	9
Halvorson, Crona and Champlin	11	8	5	6	12	5	11
Herman LLC	11	8	7	8	8	9	11
Jerde-Hilpert	12	22	16	12	8	8	11
Kassulke, Ondricka and Metz	8	9	10	10	8	8	11
Keeling LLC	15	12	10	11	8	10	8
Kiehn-Spinka	10	8	9	17	13	14	8
Koepp Ltd	11	7	16	9	18	11	10
Kuhn-Gusikowski	16	9	7	7	17	7	10
Kulas Inc	13	21	15	13	12	12	8
Pollich LLC	7	11	7	10	16	8	14
Purdy-Kunde	9	8	8	7	6	5	10
Sanford and Sons	10	8	11	7	10	16	9
Stokes LLC	2	11	10	11	18	7	13
Trantow-Barrows	16	13	9	12	21	11	12
White-Trantow	10	17	12	13	12	14	8
Will LLC	14	13	12	10	9	10	6

Much better. Similarly, we could have made a unwieldly wide table instead by either reversing the arguments in the groupby

df.groupby([dow, 'name']).size().unstack()

name	Barton LLC	Cronin, Oberbrunner and Spencer	Frami, Hills and Schmidt	Fritsch, Russel and Anderson	Halvorson, Crona and Champlin	Herman LLC	Jerde-Hilpert	Kassulke, Ondricka and Metz	Keeling LLC	Kiehn-Spinka	Koepp Ltd	Kuhn-Gusikowski	Kulas Inc	Pollich LLC	Purdy-Kunde	Sanford and Sons	Stokes LLC	Trantow-Barrows	White-Trantow	Will LLC
date
0	8	12	11	8	11	11	12	8	15	10	11	16	13	7	9	10	2	16	10	14
1	12	12	8	12	8	8	22	9	12	8	7	9	21	11	8	8	11	13	17	13
2	15	7	10	14	5	7	16	10	10	9	16	7	15	7	8	11	10	9	12	12
3	8	8	9	12	6	8	12	10	11	17	9	7	13	10	7	7	11	12	13	10
4	20	11	12	11	12	8	8	8	8	13	18	17	12	16	6	10	18	21	12	9
5	10	10	7	15	5	9	8	8	10	14	11	7	12	8	5	16	7	11	14	10
6	9	7	15	9	11	11	11	11	8	8	10	10	8	14	10	9	13	12	8	6

Or specifying that we want level=0 instead of the default 1 in unstack

df.groupby(['name', dow]).size().unstack(level=0)

name	Barton LLC	Cronin, Oberbrunner and Spencer	Frami, Hills and Schmidt	Fritsch, Russel and Anderson	Halvorson, Crona and Champlin	Herman LLC	Jerde-Hilpert	Kassulke, Ondricka and Metz	Keeling LLC	Kiehn-Spinka	Koepp Ltd	Kuhn-Gusikowski	Kulas Inc	Pollich LLC	Purdy-Kunde	Sanford and Sons	Stokes LLC	Trantow-Barrows	White-Trantow	Will LLC
date
0	8	12	11	8	11	11	12	8	15	10	11	16	13	7	9	10	2	16	10	14
1	12	12	8	12	8	8	22	9	12	8	7	9	21	11	8	8	11	13	17	13
2	15	7	10	14	5	7	16	10	10	9	16	7	15	7	8	11	10	9	12	12
3	8	8	9	12	6	8	12	10	11	17	9	7	13	10	7	7	11	12	13	10
4	20	11	12	11	12	8	8	8	8	13	18	17	12	16	6	10	18	21	12	9
5	10	10	7	15	5	9	8	8	10	14	11	7	12	8	5	16	7	11	14	10
6	9	7	15	9	11	11	11	11	8	8	10	10	8	14	10	9	13	12	8	6

`.apply`, `.map`, and `.applymap`

These 3 functions are used to manipulate data in pandas objects, but until you’ve used them a bunch, it’s often confusing which one you want in order to solve a given problem.

A decent-enough rule of thumb I use is:

.apply applies a function across a DataFrame

df.apply(max)

account number                 786968
name                         Will LLC
sku                          S2-83881
quantity                           49
unit price                      99.85
ext price                     4824.54
date              2014-12-31 12:48:35
largest order                 4824.54
dtype: object

.map applies a mapping function to each value in a Series

df['name'].map(len).head()

0    10
1    15
2     9
3    27
4    13
Name: name, dtype: int64

So when considering .applymap, it might help to think about how

apply (1) + map (2) = applymap

therefore, you can “add” their functionality as well:

.applymap applies a mapping to each value (2) of a given DataFrame (1)

df.applymap(type).head()

	account number	name	sku	quantity	unit price	ext price	date	largest order
0	<class 'int'>	<class 'str'>	<class 'str'>	<class 'int'>	<class 'float'>	<class 'float'>	<class 'pandas._libs.tslib.Timestamp'>	<class 'float'>
1	<class 'int'>	<class 'str'>	<class 'str'>	<class 'int'>	<class 'float'>	<class 'float'>	<class 'pandas._libs.tslib.Timestamp'>	<class 'float'>
2	<class 'int'>	<class 'str'>	<class 'str'>	<class 'int'>	<class 'float'>	<class 'float'>	<class 'pandas._libs.tslib.Timestamp'>	<class 'float'>
3	<class 'int'>	<class 'str'>	<class 'str'>	<class 'int'>	<class 'float'>	<class 'float'>	<class 'pandas._libs.tslib.Timestamp'>	<class 'float'>
4	<class 'int'>	<class 'str'>	<class 'str'>	<class 'int'>	<class 'float'>	<class 'float'>	<class 'pandas._libs.tslib.Timestamp'>	<class 'float'>

Resources

Group By: split-apply-combine from the Pandas docs
Pandas Grouper and Agg Functions Explained by Practical Business Python
Understanding the Transform Function in Pandas
Chapter 9 of Wes McKinney’s Python for Data Analysis