Transforming Tags into Categorical Data
I’ve encountered a few instances where I need to make clean, dummied data columns from a column that contains a list of attributes. This notebook will go one step further and show an example of generating one such list from a bunch of string fields, generated by concatenating arbitrarily-many <tag>
objects together.
An Example
In the repo Building Machine Learning Powered Applications, the author has a slick chunk of code that takes a DataFrame containing a column with a bunch of tags (I’ve dropped everything else, for simplicity’s sake)
import pandas as pd
df = pd.read_csv('data/writers.csv')
df.head()
Score | Tags | |
---|---|---|
0 | 32 | <resources><first-time-author> |
1 | 20 | <fiction><grammatical-person><third-person> |
2 | 34 | <publishing><novel><agent> |
3 | 28 | <plot><short-story><planning><brainstorming> |
4 | 21 | <fiction><genre><categories> |
and does a bunch of pandas
magic to transform it into neat columns of popular tags that they use for their modelling.
# Select our tags, represented as strings, and transform them into arrays of tags
tags = df["Tags"]
clean_tags = tags.str.split("><").apply(
lambda x: [a.strip("<").strip(">") for a in x])
# Use pandas' get_dummies to get dummy values
# select only tags that appear over 500 times
tag_columns = pd.get_dummies(clean_tags.apply(pd.Series).stack()).sum(level=0)
all_tags = tag_columns.astype(bool).sum(axis=0).sort_values(ascending=False)
top_tags = all_tags[all_tags > 500]
top_tag_columns = tag_columns[top_tags.index]
final = pd.concat([df, top_tag_columns], axis=1)
final.head()
Score | Tags | creative-writing | fiction | style | characters | technique | novel | publishing | |
---|---|---|---|---|---|---|---|---|---|
0 | 32 | <resources><first-time-author> | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 20 | <fiction><grammatical-person><third-person> | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 34 | <publishing><novel><agent> | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
3 | 28 | <plot><short-story><planning><brainstorming> | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 21 | <fiction><genre><categories> | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
However, that dense chunk of code is doing a ton, so let’s break it down step by step
Less Magic
For starters, they turn the tag strings into lists of strings with a simple apply()
call and some ><
hunting
tags = df['Tags']
clean_tags = tags.str.split('><').apply(
lambda x: [a.strip('<').strip('>') for a in x]
)
clean_tags.head()
0 [resources, first-time-author]
1 [fiction, grammatical-person, third-person]
2 [publishing, novel, agent]
3 [plot, short-story, planning, brainstorming]
4 [fiction, genre, categories]
Name: Tags, dtype: object
Dummying
This next one is a doozy. Just remember that we’re trying to go from the list of tags above, to an identity matrix like so.
tag_columns = pd.get_dummies(clean_tags.apply(pd.Series).stack()).sum(level=0)
tag_columns.head()
3-acts | academic-writing | accessibility | acronyms | action | administration-guides | agent | alternative-history | amazon | ambiguity | ... | word-choice | wordcount | world-building | writer | writers-block | writing-competitions | writing-groups | writing-instruments | young-adult | young-author | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 330 columns
Step by step, we start by turning these lists into pd.Series
objects
a = clean_tags.apply(pd.Series)
a.head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | resources | first-time-author | NaN | NaN | NaN |
1 | fiction | grammatical-person | third-person | NaN | NaN |
2 | publishing | novel | agent | NaN | NaN |
3 | plot | short-story | planning | brainstorming | NaN |
4 | fiction | genre | categories | NaN | NaN |
There are 5 columns, because that’s the most tags that are on any of our observations
clean_tags.apply(len).max()
5
We wanted it in the pd.Series
format, so we could call the .stack()
method, which tosses out all of the blank NaN
records and organizes our records into one multi-index Series
b = a.stack()
b.head(10)
0 0 resources
1 first-time-author
1 0 fiction
1 grammatical-person
2 third-person
2 0 publishing
1 novel
2 agent
3 0 plot
1 short-story
dtype: object
It’s not immediately intuitive why we want it in a Series until you look at the next step where we create the categorical columns from the data.
If we leverage this clean, stacked Series, we get something that looks pretty reasonable. Note the 330
columns.
c = pd.get_dummies(b)
c.head()
3-acts | academic-writing | accessibility | acronyms | action | administration-guides | agent | alternative-history | amazon | ambiguity | ... | word-choice | wordcount | world-building | writer | writers-block | writing-competitions | writing-groups | writing-instruments | young-adult | young-author | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 330 columns
On the other hand, if we make that same call with a
(the NaN
-filled DataFrame), we get nearly four times as many columns as our last implementation.
bad = pd.get_dummies(a)
bad.head()
0_academic-writing | 0_accessibility | 0_acronyms | 0_agent | 0_amazon | 0_antagonist | 0_apa | 0_api-documentation | 0_attention | 0_audiobook | ... | 4_voice | 4_voice-choice | 4_wikipedia | 4_word-choice | 4_wordcount | 4_world-building | 4_writers-block | 4_writing-groups | 4_young-adult | 4_young-author | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 1261 columns
We don’t have to worry about pandas
accidentally using the NULL
data
[x for x in bad.columns if 'NaN' in x]
[]
However, a quick check shows that it is incorrectly placing importance on the column index that it found the data.
[x for x in bad.columns if 'academic' in x]
['0_academic-writing',
'1_academic-writing',
'2_academic-writing',
'3_academic-writing']
Moving on, we’ve still got this MultiIndex
that we’ve got no real use for.
c.head()
3-acts | academic-writing | accessibility | acronyms | action | administration-guides | agent | alternative-history | amazon | ambiguity | ... | word-choice | wordcount | world-building | writer | writers-block | writing-competitions | writing-groups | writing-instruments | young-adult | young-author | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 330 columns
Hence the call to sum(level=0)
d = c.sum(level=0)
d.head()
3-acts | academic-writing | accessibility | acronyms | action | administration-guides | agent | alternative-history | amazon | ambiguity | ... | word-choice | wordcount | world-building | writer | writers-block | writing-competitions | writing-groups | writing-instruments | young-adult | young-author | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 330 columns
We had to specify level=0
because we wanted to ensure that our resulting DataFrame still had a row per row of data in our original DataFrame
len(df), len(d)
(7971, 7971)
Calling c.sum()
without any arguments just does a naive sum down the columns
wrong_1 = c.sum()
wrong_1.head()
3-acts 7
academic-writing 277
accessibility 6
acronyms 15
action 3
dtype: int64
And so our data is now a Series, for as many columns we had
print(len(wrong_1))
330
And calling it at level=1
uses the second level of the MultiIndex to sum each tag by which order they appear.
Here, 3-acts
appears as the second tag 3 times, the third once, fourth once, fifth 2 times.
wrong_2 = c.sum(level=1)
print(len(wrong_2))
wrong_2.head()
5
3-acts | academic-writing | accessibility | acronyms | action | administration-guides | agent | alternative-history | amazon | ambiguity | ... | word-choice | wordcount | world-building | writer | writers-block | writing-competitions | writing-groups | writing-instruments | young-adult | young-author | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 142.0 | 1.0 | 1.0 | 0.0 | 0.0 | 13.0 | 0.0 | 1.0 | 0.0 | ... | 105.0 | 16.0 | 20.0 | 0.0 | 21.0 | 0.0 | 6.0 | 0.0 | 4.0 | 2.0 |
1 | 3.0 | 99.0 | 2.0 | 1.0 | 0.0 | 1.0 | 23.0 | 1.0 | 3.0 | 0.0 | ... | 84.0 | 20.0 | 43.0 | 7.0 | 25.0 | 1.0 | 2.0 | 2.0 | 8.0 | 3.0 |
2 | 1.0 | 29.0 | 3.0 | 7.0 | 1.0 | 0.0 | 9.0 | 0.0 | 4.0 | 2.0 | ... | 29.0 | 16.0 | 22.0 | 4.0 | 16.0 | 0.0 | 3.0 | 1.0 | 15.0 | 4.0 |
3 | 1.0 | 7.0 | 0.0 | 6.0 | 1.0 | 0.0 | 4.0 | 0.0 | 6.0 | 2.0 | ... | 11.0 | 2.0 | 11.0 | 3.0 | 5.0 | 1.0 | 2.0 | 0.0 | 10.0 | 3.0 |
4 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 1.0 | 3.0 | 0.0 | ... | 5.0 | 4.0 | 1.0 | 0.0 | 4.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 |
5 rows × 330 columns
But, now we’ve got the neat one-hot representation we were after.
d.head()
3-acts | academic-writing | accessibility | acronyms | action | administration-guides | agent | alternative-history | amazon | ambiguity | ... | word-choice | wordcount | world-building | writer | writers-block | writing-competitions | writing-groups | writing-instruments | young-adult | young-author | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 330 columns
Filtering
From here, the author casts the columns as bool
(I don’t think that was necessary), does a simple sum (didn’t need to specify the axis here), and sorts the values from highest to lowest
all_tags = tag_columns.astype(bool).sum(axis=0).sort_values(ascending=False)
all_tags.head()
creative-writing 1351
fiction 1253
style 991
characters 609
technique 549
dtype: int64
Then they specify that they only want to use a Tag as a feature if it’s got more than 500 uses
top_tags = all_tags[all_tags > 500]
top_tags.head()
creative-writing 1351
fiction 1253
style 991
characters 609
technique 549
dtype: int64
This narrows our Tag count from 330 to 7
print(len(all_tags), len(top_tags))
330 7
Finally, they use top_tags.index
to get a list of the column names for the tags that meet our criteria, and use that to filter down this intermediate tag_columns
DataFrame
top_tag_columns = tag_columns[top_tags.index]
top_tag_columns.head()
creative-writing | fiction | style | characters | technique | novel | publishing | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Recombining
Last but not least, they use pd.concat()
to staple this dummied tag dataset to our original DataFrame.
Why they used pd.concat()
and not df.join()
is beyond me, as they both do the trick here.
final = pd.concat([df, top_tag_columns], axis=1)
final.head()
Score | Tags | creative-writing | fiction | style | characters | technique | novel | publishing | |
---|---|---|---|---|---|---|---|---|---|
0 | 32 | <resources><first-time-author> | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 20 | <fiction><grammatical-person><third-person> | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 34 | <publishing><novel><agent> | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
3 | 28 | <plot><short-story><planning><brainstorming> | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 21 | <fiction><genre><categories> | 0 | 1 | 0 | 0 | 0 | 0 | 0 |