That's What They Said
Intro¶
The Office has been one of my all-time favorite shows since I caught the "Michael Scott's Dunder Mifflin Scranton Meredith Palmer Memorial Celebrity Rabies Awareness Pro-Am Fun Run Race for the Cure" episode back in 2007. It's absurdly quotable, it's lousy with memorable characters, and it's got one of the best finales I've seen of any show.
Image('images/finale1.PNG')
So imagine my delight when I stumble across a dataset cataloguing every single line of dialogue across its 9 seasons. I had to do a bit of data cleaning before I started tinkering around with it (GitHub link below), but ultimately it came out looking like this:
df = do_all_data_loading()
df.head()
Where each row represents a line of dialogue, labeled by when it happened in the script and who spoke it.
With the help of some simple text-parsing methods, we can better investigate this file for classic lines from the show.
line_search(df, 'bears. beets. battlestar galactica')
And filter down to the rows of data around them
bbbg = get_dialogue(df, season=3, episode=20, scenes=[1, 4])
bbbg
And print them for your captioning-convenience
import textwrap
for idx, row in bbbg.iterrows():
*_, quote, speaker = row
print(speaker + ':')
wrapper = textwrap.TextWrapper(initial_indent='\t',
subsequent_indent='\t')
print(wrapper.fill(quote))
YouTubeVideo(id='WaaANll8h18', width=500)
Awesome.
Looking at Text¶
Words¶
Given that the show aired on NBC, it's a safe bet that hard searching for F-Bombs and other usual suspects will be fruitless. However, if we instead do a search for asterisks, as it would show up in the script, we can get to the bottom of who's got the foulest mouth at Dunder-Mifflin.
profanity = line_search(df, '\*')
profanity['speaker'].value_counts()
By the by, I'd highly encourage looking these up-- this show makes excellent use of the censor bleep, IMO
profanity
Digging around for specific words also yielded some surprising results.
For instance, that Meredith is, curiously, only the sixth-highest user of the word "Drink"
line_search(df, 'drink')['speaker'].value_counts().head(6)
Image('images/meredith.jpg')
Or that Michael spends as much time putting his foot in his mouth about gay culture as the only gay character spends talking about it.
line_search(df, 'gay')['speaker'].value_counts().head()
And less-surprising, is that Andy is far and away the biggest user of the word 'Tuna'.
line_search(df, 'tuna')['speaker'].value_counts()
Phrases¶
Of course, the best and most obvious use of our ability to zero-in on keywords in every script is to run some numbers on Michael's many, many "That's what she said" jokes.
His track record of often-ill-timed, always-inappropriate innuendos began early in the second season of the show and lasted consistently through his tenure on The Office, cresting in perhaps the dumbest, most-touching context the joke will ever see.
First, let's get ahold of each instance of the phrase
twss = line_search(df, 'that\'s what she said')
twss = twss[twss['speaker'] == 'michael']
twss
A respectible 23 times.
It's also worth a nod to the fact that on at least 13 occasions, someone else shared Michael's burden.
copycats = line_search(df, 'that\'s what she said')
copycats = copycats[copycats['speaker'] != 'michael']
len(copycats)
And that when we do a similar search on 'he' instead of 'she', we've shared the love 4 times.
twhs = line_search(df, 'that\'s what he said')
len(twhs)
One thing I was curious about, though, was who was teeing Michael up for these jokes. Getting at that was easy enough. We already have each row of data that he says it. All we have to do is grab the row right before it.
Your leaderboard:
df.loc[twss.index - 1]['speaker'].value_counts()
Unsurprisingly, Michael sets himself up often (and astute readers will have noticed that some lines above have him doing just that within the span of a sentence or two.)
But who's Lester?
A bit of poking around, and it turns out that Lester is the name of the attourney that was deposing him in season 4, lol
YouTubeVideo('ClzJkv3dpY8', start=240, end=288, width=500)
Michael Hates Toby¶
Another interesting avenue within the text data is looking at how vocabulary changes character-to-character.
If you've ever seen these two interact, it wouldn't surprise you to learn that Michael's firey hatred for his colleague in HR comes with a good deal of engendered language. And not just the word "No." (when he learns that Toby has abruptly come back to work after being away for a season).
YouTubeVideo('NHh0rf0ojEc', width=500)
And so here, I found every line that Michael delivered within a line or two from Toby.
# sample the first few records
michaelToToby = df[a_spoke_after_b(df, 'michael', 'toby')]
michaelToToby.head()
Then I extracted all of the unique words that he used throughout the course of the show when speaking to or shortly after him
michaelsWordsToToby = set(extract_corpus(michaelToToby))
print(len(michaelsWordsToToby), 'unique words')
Then, I grabbed every line that Michael said after people that he liked. Most notably, his lovers... work friends... daughter-figures... Ryan.
And I compiled every unique word that he uses with them.
peopleMichaelLikes = ['ryan', 'jim', 'dwight', 'pam', 'holly',
'darryl', 'erin', 'oscar', 'david', 'jan']
niceWords = set()
for person in peopleMichaelLikes:
michaelToPerson = df[a_spoke_after_b(df, 'michael', person)]
niceWords = niceWords.union(set(extract_corpus(michaelToPerson)))
print(len(niceWords), 'unique words')
This allowed me to generate words that Michael uniquely uses in reaction to someone he hates so damn much...
"If I had a gun with two bullets and was in a room with Hitler, Bin Laden, and Toby, I would shoot Toby twice."
-Michael Scott
...that he's never used with anyone else on the show.
print(michaelsWordsToToby - niceWords)
Links¶
All of the source code for this post can be found at https://github.com/napsterinblue/the-office-lines
Credit to /u/misunderstoodpoetry for actually pulling the dataset together