That's What They Said

Intro

The Office has been one of my all-time favorite shows since I caught the "Michael Scott's Dunder Mifflin Scranton Meredith Palmer Memorial Celebrity Rabies Awareness Pro-Am Fun Run Race for the Cure" episode back in 2007. It's absurdly quotable, it's lousy with memorable characters, and it's got one of the best finales I've seen of any show.

In [2]:
Image('images/finale1.PNG')
Out[2]:
No Spoilers.

So imagine my delight when I stumble across a dataset cataloguing every single line of dialogue across its 9 seasons. I had to do a bit of data cleaning before I started tinkering around with it (GitHub link below), but ultimately it came out looking like this:

In [3]:
df = do_all_data_loading()
df.head()
Out[3]:
id season episode scene line_text speaker
0 1 1 1 1 All right Jim. Your quarterlies look very good... michael
1 2 1 1 1 Oh, I told you. I couldn't close it. So... jim
2 3 1 1 1 So you've come to the master for guidance? Is ... michael
3 4 1 1 1 Actually, you called me in here, but yeah. jim
4 5 1 1 1 All right. Well, let me show you how it's done. michael

Where each row represents a line of dialogue, labeled by when it happened in the script and who spoke it.

With the help of some simple text-parsing methods, we can better investigate this file for classic lines from the show.

In [4]:
line_search(df, 'bears. beets. battlestar galactica')
Out[4]:
id season episode scene line_text speaker
13409 15346 3 20 1 Fact. Bears eat beets. Bears. Beets. Battl... jim

And filter down to the rows of data around them

In [5]:
bbbg = get_dialogue(df, season=3, episode=20, scenes=[1, 4])
bbbg
Out[5]:
id season episode scene line_text speaker
13405 15342 3 20 1 [Dressed as Dwight] It's kind of blurry. [puts... jim
13406 15343 3 20 1 That's a ridiculous question. dwight
13407 15344 3 20 1 False. Black bear. jim
13408 15345 3 20 1 Well that's debatable. There are basically tw... dwight
13409 15346 3 20 1 Fact. Bears eat beets. Bears. Beets. Battl... jim
13410 15347 3 20 1 Bears do not--- What is going on--- What are y... dwight
13411 15348 3 20 2 Last week, I was in a drug store and I saw the... jim
13412 15349 3 20 3 You know what? Imitation is the most sincere ... dwight
13413 15350 3 20 3 ... MICHAEL! jim
13414 15351 3 20 3 Oh, that's funny. MICHAEL! dwight

And print them for your captioning-convenience

In [6]:
import textwrap

for idx, row in bbbg.iterrows():
    *_, quote, speaker = row
    print(speaker + ':')
    wrapper = textwrap.TextWrapper(initial_indent='\t',
                                   subsequent_indent='\t')
    print(wrapper.fill(quote))
jim:
	[Dressed as Dwight] It's kind of blurry. [puts on his glasses] That's
	better. [exhales] Question.  What kind of bear is best?
dwight:
	That's a ridiculous question.
jim:
	False.  Black bear.
dwight:
	Well that's debatable.  There are basically two schools of thought---
jim:
	Fact.  Bears eat beets.  Bears.  Beets.  Battlestar Galactica.
dwight:
	Bears do not--- What is going on--- What are you doing?!
jim:
	Last week, I was in a drug store and I saw these glasses.  Uh, four
	dollars.  And it only cost me seven dollars to recreate the rest of
	the ensemble.  And that's a grand total of... [Jim calculates the
	total on his calculator-watch] eleven dollars.
dwight:
	You know what?  Imitation is the most sincere form of flattery, so I
	thank you. [Jim places a bobble-head on his desk]  Identity theft is
	not a joke, Jim!  Millions of families suffer every year!
jim:
	... MICHAEL!
dwight:
	Oh, that's funny.  MICHAEL!
In [7]:
YouTubeVideo(id='WaaANll8h18', width=500)
Out[7]:

Awesome.

Looking at Text

Words

Given that the show aired on NBC, it's a safe bet that hard searching for F-Bombs and other usual suspects will be fruitless. However, if we instead do a search for asterisks, as it would show up in the script, we can get to the bottom of who's got the foulest mouth at Dunder-Mifflin.

In [8]:
profanity = line_search(df, '\*')
profanity['speaker'].value_counts()
Out[8]:
kevin       2
michael     2
kelly       1
deangelo    1
oscar       1
darryl      1
phyllis     1
jo          1
pam         1
toby        1
brandon     1
andy        1
robert      1
ryan        1
Name: speaker, dtype: int64

By the by, I'd highly encourage looking these up-- this show makes excellent use of the censor bleep, IMO

In [9]:
profanity
Out[9]:
id season episode scene line_text speaker
7662 9564 3 1 31 I watch the L Word. I watch, Queer as F***, s... michael
11959 13896 3 15 25 This is bull****! michael
16147 18084 4 2 68 [pointing] Um... [camera reveals that 'RYAN' i... oscar
28409 30346 5 25 21 Oh! Mother ******! phyllis
29304 31241 6 2 18 I can't believe this. [mistaking Darryl's sist... toby
29361 31298 6 2 26 You come to my house, bust up my trash cans, c... darryl
41753 43690 7 18 15 My mom makes the best pesto in the world. And ... ryan
41815 43752 7 18 25 Holy s*** is that real? pam
42672 44609 7 21 34 Gimme that damn dog you f***ing thief! Don't e... deangelo
43460 45397 7 24 43 I don't have any, assh***. kelly
43621 45558 7 24 58 That's f***ing crazy. [Jo and Dwight both smil... jo
44116 46053 8 2 8 Wait, I thought she was a **** and you *******... kevin
44995 46932 8 5 15 I just got a text from Broccoli Rob - 'Boo!' S... andy
48436 50373 8 16 22 Must be doing real good since you're f***ing m... brandon
48438 50375 8 16 22 Dude, you didn't tell me you were f***ing Val.... kevin
50548 52485 8 23 35 Ah, well, I will not be blackmailed by some in... robert

Digging around for specific words also yielded some surprising results.

For instance, that Meredith is, curiously, only the sixth-highest user of the word "Drink"

In [10]:
line_search(df, 'drink')['speaker'].value_counts().head(6)
Out[10]:
michael     38
dwight      22
jim         19
pam         12
andy        10
meredith     9
Name: speaker, dtype: int64
In [11]:
Image('images/meredith.jpg')
Out[11]:

Or that Michael spends as much time putting his foot in his mouth about gay culture as the only gay character spends talking about it.

In [12]:
line_search(df, 'gay')['speaker'].value_counts().head()
Out[12]:
oscar      31
michael    31
dwight     17
andy       15
pam         8
Name: speaker, dtype: int64

And less-surprising, is that Andy is far and away the biggest user of the word 'Tuna'.

In [13]:
line_search(df, 'tuna')['speaker'].value_counts()
Out[13]:
andy                63
jim                 13
dwight               9
michael              7
walter jr            2
erin                 2
kevin                2
holly                2
gabe                 2
mark                 1
both                 1
david wallace        1
creed                1
angela               1
robert               1
front desk clerk     1
kelly                1
Name: speaker, dtype: int64

Phrases

Of course, the best and most obvious use of our ability to zero-in on keywords in every script is to run some numbers on Michael's many, many "That's what she said" jokes.

His track record of often-ill-timed, always-inappropriate innuendos began early in the second season of the show and lasted consistently through his tenure on The Office, cresting in perhaps the dumbest, most-touching context the joke will ever see.

First, let's get ahold of each instance of the phrase

In [14]:
twss = line_search(df, 'that\'s what she said')
twss = twss[twss['speaker'] == 'michael']
twss
Out[14]:
id season episode scene line_text speaker
1963 2545 2 2 24 That's what she said. Pam? michael
2012 2594 2 2 34 THAT'S WHAT SHE SAID! michael
4090 5325 2 10 2 A, that's what she said, and B, I wanted it to... michael
6026 7644 2 17 6 That's what she said! michael
7044 8872 2 21 22 That's what she said. [Jim mouths these words ... michael
7722 9624 3 1 49 I am glad that today spurred social change. T... michael
10692 12594 3 10 49 Oh. [She whispers in his ear. Michael starts t... michael
12365 14302 3 17 9 That's what she said. michael
13449 15386 3 20 11 No, no. I need two men on this. That's what ... michael
15633 17570 4 2 6 Hey. Can you make that straighter? That's what... michael
17023 18960 4 4 45 And the best way to start is to hit start. And... michael
18185 20122 4 7 56 That's what I said. That's what she said. michael
18333 20270 4 8 23 That's what she said. michael
18335 20272 4 8 23 That's what she said. michael
18341 20278 4 8 23 Come again? That's what she said? I don't know... michael
18779 20716 4 9 19 [yells] THAT'S WHAT SHE SAID! [Jan gets an evi... michael
19544 21481 4 12 2 [muffled] That's what she said. michael
21213 23150 5 1 111 [from his office] That's what she said. michael
21974 23911 5 4 30 It squeaks when you bang it, that's what she s... michael
22260 24197 5 5 25 That's what she said. michael
34434 36371 6 18 9 That's what she said. michael
38650 40587 7 8 29 [grimacing] That's what she said. [leaves] michael
42759 44696 7 21 51 [putting his shoes back on, talking to the cam... michael

A respectible 23 times.

It's also worth a nod to the fact that on at least 13 occasions, someone else shared Michael's burden.

In [15]:
copycats = line_search(df, 'that\'s what she said')
copycats = copycats[copycats['speaker'] != 'michael']
len(copycats)
Out[15]:
13

And that when we do a similar search on 'he' instead of 'she', we've shared the love 4 times.

In [16]:
twhs = line_search(df, 'that\'s what he said')
len(twhs)
Out[16]:
4

One thing I was curious about, though, was who was teeing Michael up for these jokes. Getting at that was easy enough. We already have each row of data that he says it. All we have to do is grab the row right before it.

Your leaderboard:

In [17]:
df.loc[twss.index - 1]['speaker'].value_counts()
Out[17]:
jim             5
michael         3
lester          3
jan             2
kevin           1
holly           1
angela          1
darryl          1
pam             1
second cindy    1
phyllis         1
gabe            1
andy            1
dwight          1
Name: speaker, dtype: int64

Unsurprisingly, Michael sets himself up often (and astute readers will have noticed that some lines above have him doing just that within the span of a sentence or two.)

But who's Lester?

A bit of poking around, and it turns out that Lester is the name of the attourney that was deposing him in season 4, lol

In [18]:
YouTubeVideo('ClzJkv3dpY8', start=240, end=288, width=500)
Out[18]:

Michael Hates Toby

Another interesting avenue within the text data is looking at how vocabulary changes character-to-character.

If you've ever seen these two interact, it wouldn't surprise you to learn that Michael's firey hatred for his colleague in HR comes with a good deal of engendered language. And not just the word "No." (when he learns that Toby has abruptly come back to work after being away for a season).

In [19]:
YouTubeVideo('NHh0rf0ojEc', width=500)
Out[19]:

And so here, I found every line that Michael delivered within a line or two from Toby.

In [20]:
# sample the first few records
michaelToToby = df[a_spoke_after_b(df, 'michael', 'toby')]
michaelToToby.head()
Out[20]:
id season episode scene line_text speaker
340 382 1 2 16 Get out. michael
342 384 1 2 16 No, this is not a joke. OK? That was offensive... michael
343 385 1 2 17 [on the tape] Hi. I'm Michael Scott. I'm in ch... michael
1277 1651 1 6 12 Toby, Katy. michael
1283 1657 1 6 12 Toby's divorced. He uh, guh recently, right? michael

Then I extracted all of the unique words that he used throughout the course of the show when speaking to or shortly after him

In [21]:
michaelsWordsToToby = set(extract_corpus(michaelToToby))
print(len(michaelsWordsToToby), 'unique words')
1222 unique words

Then, I grabbed every line that Michael said after people that he liked. Most notably, his lovers... work friends... daughter-figures... Ryan.

And I compiled every unique word that he uses with them.

In [22]:
peopleMichaelLikes = ['ryan', 'jim', 'dwight', 'pam', 'holly',
                      'darryl', 'erin', 'oscar', 'david', 'jan']

niceWords = set()

for person in peopleMichaelLikes:
    michaelToPerson = df[a_spoke_after_b(df, 'michael', person)]
    niceWords = niceWords.union(set(extract_corpus(michaelToPerson)))
    
print(len(niceWords), 'unique words')
7680 unique words

This allowed me to generate words that Michael uniquely uses in reaction to someone he hates so damn much...

"If I had a gun with two bullets and was in a room with Hitler, Bin Laden, and Toby, I would shoot Toby twice."

-Michael Scott

...that he's never used with anyone else on the show.

In [23]:
print(michaelsWordsToToby - niceWords)
{'jerk-face', 'nutcases', 'retarded', 'souls', 'shaolin', 'lift', 'heartwarming', 'dawg', 'work-associated', 'assuming', 'culturally', 'principles', 'inferring', 'sream', 'molest', 'slepping', 'twisted', 'binder', 'mediators', 'welcoming', 'smack', 'mornin', 'pufnstuf', 'imploring', 'insisting', 'temple', 'bored', 'pitcher', 'grimaces', 'farting', 'immature', 'shine', 'icebreaker', 'retards', 'creedstanley', 'status', 'anti-christ', 'aaaah', 'benefit', 'zip', 'throats', 'climate', 'meantime', 'interruption', 'influence', 'heartless', 'plague', 'racist', 'anticipation', 'disorder', 'overstating', 'infected', 'mamas', 'balers', 'neve', 'punishment', 'noun', 'overstaying', 'sassy', 'committed', 'beeps', 'jeff', 'sweating', 'dealt', 'virus', 'joshin', 'heh', 'whomevers', 'resistance', 'radon', 'borientationtalks', 'affecting', 'gives-what-what', 'chosen', 'erics', 'lobster', 'interim', 'nigeria', 'crumbles', 'horribleness', 'cutie-pie', 'primary', 'includes', 'snail', 'powerful', 'doll', 'veterinarian', 'affective', 'lamaze', 'director', 'pretended', 'soulless', 'tan', 'flashed', 'insists', 'nile', 'seasonal', 'styles', 'taxpayer', 'villages', 'counseling', 'air-condition---', 'goal', 'slack', 'relatively', 'explicit', 'images', 'involving', 'abraham', 'freedoms', 'uncalled', 'stutter', 'cutoff', 'collect', 'psychological', 'ads', 'probed', 'fatigue', 'agenda-actually', 'pukeys', 'justdrag', 'red-headed', 'caprese', 'exhalesi', 'all-in', 'legitimate', 'pagers', 'alcoholic', 'perv', 'donut', 'verdict', 'hammer', 'carpools', 'discreet', 'witty', 'answered', 'progress', 'undergoing', 'rightful', 'charming', 'meenie', 'been--', 'nuisance', 'wha-wha-wha-wha-what', 'redacted', 'peas', 'psych', 'shifting', 'counselor', 'alf', 'pristine', 'nyeh', 'failure', 'mothers', 'miney', 'unlocks', 'despite', 'squat', 'cruisin', 'mutiny', 'irritability', 'eeny', 'jerky', 'crabs', 'opener', 'air-conditioner', 'flasher', 'alien', 'foliage', 'select', 'skull', 'kills', 'boredom', 'slate', 'cop', 'aaaahhh', 'disrespectful', 'bruisin', 'bedpost', 'tested', 'albeit', 'blow-up', 'whisper', 'winnings', 'puppet', 'beaches', 'respected', 'conflict', 'smelly', 'conflictin', 'biologically', 'comedic', 'fortunate', 'yanks', 'pulp', 'campbell', 'quitter', 'instructed', 'pukey', 'bent', 'steeped', 'inception', 'towel', 'cornerstone', 'peed', 'blabbering', 'notches', 'donuts'}

Links

All of the source code for this post can be found at https://github.com/napsterinblue/the-office-lines

Credit to /u/misunderstoodpoetry for actually pulling the dataset together