Transforming Tags into Categorical Data

11 Oct 2020

I’ve encountered a few instances where I need to make clean, dummied data columns from a column that contains a list of attributes. This notebook will go one step further and show an example of generating one such list from a bunch of string fields, generated by concatenating arbitrarily-many <tag> objects together.

An Example

In the repo Building Machine Learning Powered Applications, the author has a slick chunk of code that takes a DataFrame containing a column with a bunch of tags (I’ve dropped everything else, for simplicity’s sake)

import pandas as pd

df = pd.read_csv('data/writers.csv')

df.head()

	Score	Tags
0	32	<resources><first-time-author>
1	20	<fiction><grammatical-person><third-person>
2	34	<publishing><novel><agent>
3	28	<plot><short-story><planning><brainstorming>
4	21	<fiction><genre><categories>

and does a bunch of pandas magic to transform it into neat columns of popular tags that they use for their modelling.

# Select our tags, represented as strings, and transform them into arrays of tags
tags = df["Tags"]
clean_tags = tags.str.split("><").apply(
    lambda x: [a.strip("<").strip(">") for a in x])

# Use pandas' get_dummies to get dummy values 
# select only tags that appear over 500 times
tag_columns = pd.get_dummies(clean_tags.apply(pd.Series).stack()).sum(level=0)
all_tags = tag_columns.astype(bool).sum(axis=0).sort_values(ascending=False)
top_tags = all_tags[all_tags > 500]
top_tag_columns = tag_columns[top_tags.index]

final = pd.concat([df, top_tag_columns], axis=1)
final.head()

	Score	Tags	fiction	novel	publishing
0	32	<resources><first-time-author>	0	0	0
1	20	<fiction><grammatical-person><third-person>	1	0	0
2	34	<publishing><novel><agent>	0	1	1
3	28	<plot><short-story><planning><brainstorming>	0	0	0
4	21	<fiction><genre><categories>	1	0	0

However, that dense chunk of code is doing a ton, so let’s break it down step by step

Less Magic

For starters, they turn the tag strings into lists of strings with a simple apply() call and some >< hunting

tags = df['Tags']

clean_tags = tags.str.split('><').apply(
    lambda x: [a.strip('<').strip('>') for a in x]
)

clean_tags.head()

0                  [resources, first-time-author]
1     [fiction, grammatical-person, third-person]
2                      [publishing, novel, agent]
3    [plot, short-story, planning, brainstorming]
4                    [fiction, genre, categories]
Name: Tags, dtype: object

Dummying

This next one is a doozy. Just remember that we’re trying to go from the list of tags above, to an identity matrix like so.

tag_columns = pd.get_dummies(clean_tags.apply(pd.Series).stack()).sum(level=0)

tag_columns.head()

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

5 rows × 330 columns

Step by step, we start by turning these lists into pd.Series objects

a = clean_tags.apply(pd.Series)
a.head()

	0	1	2	3	4
0	resources	first-time-author	NaN	NaN	NaN
1	fiction	grammatical-person	third-person	NaN	NaN
2	publishing	novel	agent	NaN	NaN
3	plot	short-story	planning	brainstorming	NaN
4	fiction	genre	categories	NaN	NaN

There are 5 columns, because that’s the most tags that are on any of our observations

clean_tags.apply(len).max()

We wanted it in the pd.Series format, so we could call the .stack() method, which tosses out all of the blank NaN records and organizes our records into one multi-index Series

b = a.stack()
b.head(10)

0  0             resources
   1     first-time-author
1  0               fiction
   1    grammatical-person
   2          third-person
2  0            publishing
   1                 novel
   2                 agent
3  0                  plot
   1           short-story
dtype: object

It’s not immediately intuitive why we want it in a Series until you look at the next step where we create the categorical columns from the data.

If we leverage this clean, stacked Series, we get something that looks pretty reasonable. Note the 330 columns.

c = pd.get_dummies(b)
c.head()

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

5 rows × 330 columns

On the other hand, if we make that same call with a (the NaN-filled DataFrame), we get nearly four times as many columns as our last implementation.

bad = pd.get_dummies(a)
bad.head()

	0_academic-writing	0_accessibility	0_acronyms	0_agent	0_amazon	0_antagonist	0_apa	0_api-documentation	0_attention	0_audiobook	...	4_voice	4_voice-choice	4_wikipedia	4_word-choice	4_wordcount	4_world-building	4_writers-block	4_writing-groups	4_young-adult	4_young-author
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 1261 columns

We don’t have to worry about pandas accidentally using the NULL data

[x for x in bad.columns if 'NaN' in x]

[]

However, a quick check shows that it is incorrectly placing importance on the column index that it found the data.

[x for x in bad.columns if 'academic' in x]

['0_academic-writing',
 '1_academic-writing',
 '2_academic-writing',
 '3_academic-writing']

Moving on, we’ve still got this MultiIndex that we’ve got no real use for.

c.head()

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

5 rows × 330 columns

Hence the call to sum(level=0)

d = c.sum(level=0)
d.head()

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

5 rows × 330 columns

We had to specify level=0 because we wanted to ensure that our resulting DataFrame still had a row per row of data in our original DataFrame

len(df), len(d)

(7971, 7971)

Calling c.sum() without any arguments just does a naive sum down the columns

wrong_1 = c.sum()
wrong_1.head()

3-acts                7
academic-writing    277
accessibility         6
acronyms             15
action                3
dtype: int64

And so our data is now a Series, for as many columns we had

print(len(wrong_1))

And calling it at level=1 uses the second level of the MultiIndex to sum each tag by which order they appear.

Here, 3-acts appears as the second tag 3 times, the third once, fourth once, fifth 2 times.

wrong_2 = c.sum(level=1)
print(len(wrong_2))
wrong_2.head()

	3-acts	academic-writing	accessibility	acronyms	action	administration-guides	agent	alternative-history	amazon	ambiguity	...	word-choice	wordcount	world-building	writer	writers-block	writing-competitions	writing-groups	writing-instruments	young-adult	young-author
0	0.0	142.0	1.0	1.0	0.0	0.0	13.0	0.0	1.0	0.0	...	105.0	16.0	20.0	0.0	21.0	0.0	6.0	0.0	4.0	2.0
1	3.0	99.0	2.0	1.0	0.0	1.0	23.0	1.0	3.0	0.0	...	84.0	20.0	43.0	7.0	25.0	1.0	2.0	2.0	8.0	3.0
2	1.0	29.0	3.0	7.0	1.0	0.0	9.0	0.0	4.0	2.0	...	29.0	16.0	22.0	4.0	16.0	0.0	3.0	1.0	15.0	4.0
3	1.0	7.0	0.0	6.0	1.0	0.0	4.0	0.0	6.0	2.0	...	11.0	2.0	11.0	3.0	5.0	1.0	2.0	0.0	10.0	3.0
4	2.0	0.0	0.0	0.0	1.0	0.0	2.0	1.0	3.0	0.0	...	5.0	4.0	1.0	0.0	4.0	0.0	1.0	0.0	3.0	1.0

5 rows × 330 columns

But, now we’ve got the neat one-hot representation we were after.

d.head()

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

5 rows × 330 columns

Filtering

From here, the author casts the columns as bool (I don’t think that was necessary), does a simple sum (didn’t need to specify the axis here), and sorts the values from highest to lowest

all_tags = tag_columns.astype(bool).sum(axis=0).sort_values(ascending=False)
all_tags.head()

creative-writing    1351
fiction             1253
style                991
characters           609
technique            549
dtype: int64

Then they specify that they only want to use a Tag as a feature if it’s got more than 500 uses

top_tags = all_tags[all_tags > 500]
top_tags.head()

creative-writing    1351
fiction             1253
style                991
characters           609
technique            549
dtype: int64

This narrows our Tag count from 330 to 7

print(len(all_tags), len(top_tags))

330 7

Finally, they use top_tags.index to get a list of the column names for the tags that meet our criteria, and use that to filter down this intermediate tag_columns DataFrame

top_tag_columns = tag_columns[top_tags.index]
top_tag_columns.head()

	fiction	novel	publishing
0	0	0	0
1	1	0	0
2	0	1	1
3	0	0	0
4	1	0	0

Recombining

Last but not least, they use pd.concat() to staple this dummied tag dataset to our original DataFrame.

Why they used pd.concat() and not df.join() is beyond me, as they both do the trick here.

final = pd.concat([df, top_tag_columns], axis=1)
final.head()

	Score	Tags	fiction	novel	publishing
0	32	<resources><first-time-author>	0	0	0
1	20	<fiction><grammatical-person><third-person>	1	0	0
2	34	<publishing><novel><agent>	0	1	1
3	28	<plot><short-story><planning><brainstorming>	0	0	0
4	21	<fiction><genre><categories>	1	0	0

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

	0_academic-writing	0_accessibility	0_acronyms	0_agent	0_amazon	0_antagonist	0_apa	0_api-documentation	0_attention	0_audiobook	...	4_voice	4_voice-choice	4_wikipedia	4_word-choice	4_wordcount	4_world-building	4_writers-block	4_writing-groups	4_young-adult	4_young-author
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

	0_academic-writing	0_accessibility	0_acronyms	0_agent	0_amazon	0_antagonist	0_apa	0_api-documentation	0_attention	0_audiobook	...	4_voice	4_voice-choice	4_wikipedia	4_word-choice	4_wordcount	4_world-building	4_writers-block	4_writing-groups	4_young-adult	4_young-author
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

	0_academic-writing	0_accessibility	0_acronyms	0_agent	0_amazon	0_antagonist	0_apa	0_api-documentation	0_attention	0_audiobook	...	4_voice	4_voice-choice	4_wikipedia	4_word-choice	4_wordcount	4_world-building	4_writers-block	4_writing-groups	4_young-adult	4_young-author
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

		...
0	0	...
0	1	...
1	0	...
	1	...
	2	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...

	agent	...
0	0	...
1	0	...
2	1	...
3	0	...
4	0	...