My Foray Into Prepping Data for Analysis

Recently I saw a really great analysis of data called ‘The Office’ Characters’ Most Distinguishing Words. I dug around some and the original data is available here.

I wanted to try to do something like this on my own, but there’s not a lot of data available for a lot of shows. My favorite TV is Frasier, which had an eleven season run in the late 90’s. Did you know that Kelsey Grammer was on-air for 20 years between Cheers and Frasier? I didn’t know that.

After doing some research, I found http://www.kacl780.net/ where users have transcribed all of the episodes of Frasier. At first I thought that I would have to build a scraper using rvest or maybe scrapy but instead I was able to find a repo on Github where someone else did all of the heavy lifting. It’s available here.

The majority of my professional work is done using APIs or packages written by other people, so I don’t have to spend a lot of time getting and cleaning data. After looking at the transcripts of one show, I realized this might be more challenging that I imagined.

Let’s take a look at a example transcript:

Act One.

THE JOB

Scene One - KACL
The Frasier Crane Show.  Dr. Frasier Crane, the host, is at his console,
admonishing a caller; Roz Doyle, his call-screener, is in her booth.

Frasier: [firmly] Listen to yourself, Bob!  You follow her to work,
you eavesdrop on her calls, you open her mail.  The minute
you started doing these things, the relationship was over!
[polite] Thank you for your call. [presses a button; to Roz]
Roz, I think we have time for one more?

Roz speaks in a soothing radio voice.

Roz: Yes, Dr Crane.  On line four, we have Russell from Kirkland.
Frasier: [presses a button] Hello, Russell.  This is Dr Frasier Crane;
I'm listening.
Russell: [v.o.] Well, I've been feeling sort of, uh, you know,
depressed lately. [Roz looks at the clock] My life's not
going anywhere and-and, er, it's not that bad.  It's just
the same old apartment, same old job...


My goal was to parse through each transcript and turn those characters lines into something that can be analyzed:

Character Line
Frasier Listen to yourself, Bob! You follow her to work, you eavesdrop on her calls, you open her mail. The minute you started doing these things, the relationship was over! Thank you for your call. Roz, I think we have time for one more?
Roz Yes, Dr Crane. On line four, we have Russell from Kirkland.
Frasier Hello, Russell. This is Dr Frasier Crane; I’m listening.
Russell Well, I’ve been feeling sort of, uh, you know, depressed lately. My life’s not going anywhere and-and, er, it’s not that bad. It’s just the same old apartment, same old job…

Intially I tried tackling this with R using a combination of for loops, grep and gsub but I was getting absolutely no where.

I turned my attenion to BASH and found success maily using sed and a loop. Here’s the script:

I needed to remove lines like Roz speaks in a soothing radio voice.  that aren’t actual characters lines.

I removed those lines using sed 's/[A-z]*:/!&/g; s/^ \{1,\}/!/g. This will look for lines that don’t start with a line like Fraiser:. It turned the line into !Roz speaks in a soothing radio voice.

Then I was able to remove those lines with s/^[^!*]*//; s/!\{1,\}//g; s/$.*$//g since each offending line started with !.

The next piece:

tr '\n' ' ' | sed 's/[A-z]*:/\

&/g' | sed 's/ \{2,\}/ /g'


Turns our characters lines into: Frasier: Listen to yourself, Bob! You follow her to work, you eavesdrop on her calls, you open her mail. The minute you started doing these things, the relationship was over! Thank you for your call. Roz, I think we have time for one more?.

It hasn’t been perfected yet, but it’s a least a start. My next steps include trying to clean up the formatting of the data and adding additional meta data from the script.

I don’t think it will be possible create a CSV file in the format I want, I might have to make multiple files and join them together in R.

If you would like to follow along, my project is available here.

Analyzing Frasier: