Transforming Tags into Categorical Data

I’ve encountered a few instances where I need to make clean, dummied data columns from a column that contains a list of attributes. This notebook will go one step further and show an example of generating one such list from a bunch of string fields, generated by concatenating arbitrarily-many <tag> objects together.

An Example

In the repo Building Machine Learning Powered Applications, the author has a slick chunk of code that takes a DataFrame containing a column with a bunch of tags (I’ve dropped everything else, for simplicity’s sake)

import pandas as pd

df = pd.read_csv('data/writers.csv')

df.head()
Score Tags
0 32 <resources><first-time-author>
1 20 <fiction><grammatical-person><third-person>
2 34 <publishing><novel><agent>
3 28 <plot><short-story><planning><brainstorming>
4 21 <fiction><genre><categories>

and does a bunch of pandas magic to transform it into neat columns of popular tags that they use for their modelling.

# Select our tags, represented as strings, and transform them into arrays of tags
tags = df["Tags"]
clean_tags = tags.str.split("><").apply(
    lambda x: [a.strip("<").strip(">") for a in x])

# Use pandas' get_dummies to get dummy values 
# select only tags that appear over 500 times
tag_columns = pd.get_dummies(clean_tags.apply(pd.Series).stack()).sum(level=0)
all_tags = tag_columns.astype(bool).sum(axis=0).sort_values(ascending=False)
top_tags = all_tags[all_tags > 500]
top_tag_columns = tag_columns[top_tags.index]

final = pd.concat([df, top_tag_columns], axis=1)
final.head()
Score Tags creative-writing fiction style characters technique novel publishing
0 32 <resources><first-time-author> 0 0 0 0 0 0 0
1 20 <fiction><grammatical-person><third-person> 0 1 0 0 0 0 0
2 34 <publishing><novel><agent> 0 0 0 0 0 1 1
3 28 <plot><short-story><planning><brainstorming> 0 0 0 0 0 0 0
4 21 <fiction><genre><categories> 0 1 0 0 0 0 0

However, that dense chunk of code is doing a ton, so let’s break it down step by step

Less Magic

For starters, they turn the tag strings into lists of strings with a simple apply() call and some >< hunting

tags = df['Tags']

clean_tags = tags.str.split('><').apply(
    lambda x: [a.strip('<').strip('>') for a in x]
)

clean_tags.head()
0                  [resources, first-time-author]
1     [fiction, grammatical-person, third-person]
2                      [publishing, novel, agent]
3    [plot, short-story, planning, brainstorming]
4                    [fiction, genre, categories]
Name: Tags, dtype: object

Dummying

This next one is a doozy. Just remember that we’re trying to go from the list of tags above, to an identity matrix like so.

tag_columns = pd.get_dummies(clean_tags.apply(pd.Series).stack()).sum(level=0)

tag_columns.head()
3-acts academic-writing accessibility acronyms action administration-guides agent alternative-history amazon ambiguity ... word-choice wordcount world-building writer writers-block writing-competitions writing-groups writing-instruments young-adult young-author
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 330 columns

Step by step, we start by turning these lists into pd.Series objects

a = clean_tags.apply(pd.Series)
a.head()
0 1 2 3 4
0 resources first-time-author NaN NaN NaN
1 fiction grammatical-person third-person NaN NaN
2 publishing novel agent NaN NaN
3 plot short-story planning brainstorming NaN
4 fiction genre categories NaN NaN

There are 5 columns, because that’s the most tags that are on any of our observations

clean_tags.apply(len).max()
5

We wanted it in the pd.Series format, so we could call the .stack() method, which tosses out all of the blank NaN records and organizes our records into one multi-index Series

b = a.stack()
b.head(10)
0  0             resources
   1     first-time-author
1  0               fiction
   1    grammatical-person
   2          third-person
2  0            publishing
   1                 novel
   2                 agent
3  0                  plot
   1           short-story
dtype: object

It’s not immediately intuitive why we want it in a Series until you look at the next step where we create the categorical columns from the data.

If we leverage this clean, stacked Series, we get something that looks pretty reasonable. Note the 330 columns.

c = pd.get_dummies(b)
c.head()
3-acts academic-writing accessibility acronyms action administration-guides agent alternative-history amazon ambiguity ... word-choice wordcount world-building writer writers-block writing-competitions writing-groups writing-instruments young-adult young-author
0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 330 columns

On the other hand, if we make that same call with a (the NaN-filled DataFrame), we get nearly four times as many columns as our last implementation.

bad = pd.get_dummies(a)
bad.head()
0_academic-writing 0_accessibility 0_acronyms 0_agent 0_amazon 0_antagonist 0_apa 0_api-documentation 0_attention 0_audiobook ... 4_voice 4_voice-choice 4_wikipedia 4_word-choice 4_wordcount 4_world-building 4_writers-block 4_writing-groups 4_young-adult 4_young-author
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 1261 columns

We don’t have to worry about pandas accidentally using the NULL data

[x for x in bad.columns if 'NaN' in x]
[]

However, a quick check shows that it is incorrectly placing importance on the column index that it found the data.

[x for x in bad.columns if 'academic' in x]
['0_academic-writing',
 '1_academic-writing',
 '2_academic-writing',
 '3_academic-writing']

Moving on, we’ve still got this MultiIndex that we’ve got no real use for.

c.head()
3-acts academic-writing accessibility acronyms action administration-guides agent alternative-history amazon ambiguity ... word-choice wordcount world-building writer writers-block writing-competitions writing-groups writing-instruments young-adult young-author
0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 330 columns

Hence the call to sum(level=0)

d = c.sum(level=0)
d.head()
3-acts academic-writing accessibility acronyms action administration-guides agent alternative-history amazon ambiguity ... word-choice wordcount world-building writer writers-block writing-competitions writing-groups writing-instruments young-adult young-author
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 330 columns

We had to specify level=0 because we wanted to ensure that our resulting DataFrame still had a row per row of data in our original DataFrame

len(df), len(d)
(7971, 7971)

Calling c.sum() without any arguments just does a naive sum down the columns

wrong_1 = c.sum()
wrong_1.head()
3-acts                7
academic-writing    277
accessibility         6
acronyms             15
action                3
dtype: int64

And so our data is now a Series, for as many columns we had

print(len(wrong_1))
330

And calling it at level=1 uses the second level of the MultiIndex to sum each tag by which order they appear.

Here, 3-acts appears as the second tag 3 times, the third once, fourth once, fifth 2 times.

wrong_2 = c.sum(level=1)
print(len(wrong_2))
wrong_2.head()
5
3-acts academic-writing accessibility acronyms action administration-guides agent alternative-history amazon ambiguity ... word-choice wordcount world-building writer writers-block writing-competitions writing-groups writing-instruments young-adult young-author
0 0.0 142.0 1.0 1.0 0.0 0.0 13.0 0.0 1.0 0.0 ... 105.0 16.0 20.0 0.0 21.0 0.0 6.0 0.0 4.0 2.0
1 3.0 99.0 2.0 1.0 0.0 1.0 23.0 1.0 3.0 0.0 ... 84.0 20.0 43.0 7.0 25.0 1.0 2.0 2.0 8.0 3.0
2 1.0 29.0 3.0 7.0 1.0 0.0 9.0 0.0 4.0 2.0 ... 29.0 16.0 22.0 4.0 16.0 0.0 3.0 1.0 15.0 4.0
3 1.0 7.0 0.0 6.0 1.0 0.0 4.0 0.0 6.0 2.0 ... 11.0 2.0 11.0 3.0 5.0 1.0 2.0 0.0 10.0 3.0
4 2.0 0.0 0.0 0.0 1.0 0.0 2.0 1.0 3.0 0.0 ... 5.0 4.0 1.0 0.0 4.0 0.0 1.0 0.0 3.0 1.0

5 rows × 330 columns

But, now we’ve got the neat one-hot representation we were after.

d.head()
3-acts academic-writing accessibility acronyms action administration-guides agent alternative-history amazon ambiguity ... word-choice wordcount world-building writer writers-block writing-competitions writing-groups writing-instruments young-adult young-author
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 330 columns

Filtering

From here, the author casts the columns as bool (I don’t think that was necessary), does a simple sum (didn’t need to specify the axis here), and sorts the values from highest to lowest

all_tags = tag_columns.astype(bool).sum(axis=0).sort_values(ascending=False)
all_tags.head()
creative-writing    1351
fiction             1253
style                991
characters           609
technique            549
dtype: int64

Then they specify that they only want to use a Tag as a feature if it’s got more than 500 uses

top_tags = all_tags[all_tags > 500]
top_tags.head()
creative-writing    1351
fiction             1253
style                991
characters           609
technique            549
dtype: int64

This narrows our Tag count from 330 to 7

print(len(all_tags), len(top_tags))
330 7

Finally, they use top_tags.index to get a list of the column names for the tags that meet our criteria, and use that to filter down this intermediate tag_columns DataFrame

top_tag_columns = tag_columns[top_tags.index]
top_tag_columns.head()
creative-writing fiction style characters technique novel publishing
0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0
2 0 0 0 0 0 1 1
3 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0

Recombining

Last but not least, they use pd.concat() to staple this dummied tag dataset to our original DataFrame.

Why they used pd.concat() and not df.join() is beyond me, as they both do the trick here.

final = pd.concat([df, top_tag_columns], axis=1)
final.head()
Score Tags creative-writing fiction style characters technique novel publishing
0 32 <resources><first-time-author> 0 0 0 0 0 0 0
1 20 <fiction><grammatical-person><third-person> 0 1 0 0 0 0 0
2 34 <publishing><novel><agent> 0 0 0 0 0 1 1
3 28 <plot><short-story><planning><brainstorming> 0 0 0 0 0 0 0
4 21 <fiction><genre><categories> 0 1 0 0 0 0 0