Text Analysis: Part II More Adventures in Cleaning Data

Chip Oglesby bio photo By Chip Oglesby

In my previous post we took a look at a couple of different ways to parse unstructured text from the internet and prepare it for analysis in R using tidytext.

Yesterday evening I took a look at the subtools package in R. It looked like a promising tool that lets you format and analyze subtitles.

Here’s a raw example of the subtitle information from season one, episode one:

00:00:08,842 --> 00:00:11,220
Listen to yourself, Bob!

00:00:11,221 --> 00:00:15,284
You follow her to work, you eavesdrop on her calls, you open her mail

00:00:15,285 --> 00:00:18,994
The minute you started doing these things, the relationship was over!

00:00:19,574 --> 00:00:20,720
Thank you for your call.

00:00:23,487 --> 00:00:25,282
Roz, I think we have time for one more?

Which is a lot different than yesterday’s example:

Frasier: Listen to yourself, Bob!  You follow her to work, you eavesdrop on her
calls, you open her mail.  The minute you started doing these things, the
relationship was over! Thank you for your call. Roz, I think we have time for
one more?

subtools does a great job of parsing data into a data frame with a result like:

ID Timecode.in Timecode.out Text season episode_num speakingTime
1 00:00:08.842 00:00:11.220 listen to yourself, bob 1 1 2.75e-05
2 00:00:11.221 00:00:15.284 you follow her to work, you eavesdrop on her calls, you open her mail 1 1 4.70e-05
3 00:00:15.285 00:00:18.994 the minute you started doing these things, the relationship was over 1 1 4.29e-05
4 00:00:19.574 00:00:20.720 thank you for your call 1 1 1.33e-05
5 00:00:23.487 00:00:25.282 roz, i think we have time for one more 1 1 2.08e-05
6 00:00:25.283 00:00:26.571 yes, dr crane 1 1 1.49e-05

I manually added the speakingTime column and I’ve reached out to the author of the package to see if this might be usable in some format, but currently I don’t think I’ll be able to use it.

If I just wanted to do a sentiment and text analysis on the words alone and not include any character information, then the subtitle information would be all I need.

The goal though, is to replicate yesterday’s example of the visualization from ‘The Office’.

After downloading the data from the internet and getting it in R I thought of creating a key and attempting to join the to data frames together. I would take a string like:

Hello Russel. This is Dr. Frasier Crane; I'm listening.

making it:


This worked for some of the data, but not all. Once the episode and subtitle datasets were combined, I was left with a dataframe of ~315,000 observations. This was mainly created from cartesian products since I was only joining on the “key” I created. Most of those cartesian products were from single word lines such as what, yes, thank you, hello, oh and no for example.

In theory, I should have been able to use the key to get the season and episode from the subtitle information to give me a bit more metadata.

Since I couldn’t join on key, season and episode, I had to go back to the drawing board. What else can I do?

Going back to bash I realized that I already the season and episode information conviently stored as the name of each text file. 0101.txt will give me season 1, episode one, so let’s set something up in bash.

Here’s what I’ve come up with:

seasonShow=$(echo "${filename##*/}" | sed "s/.txt//g")
sed "s/.*/&\,\"$seasonShow\"/" >> ../data/csv/lines.csv

This is going to do two things. seasonShow will get the filename with sed and the second sed command will add to the output of each line as a new column for my pseudo csv file.

Later on in R, I’ll uses dplyr’s mutate to create columns based on that new information:

episodes %<>%
mutate(season = as.integer(substr(seasonEpisode, 0, 2)),
       episode = as.integer(substr(seasonEpisode, 3, 4)))

Now we’re getting closer to a product that we can finally start to analyze. There’s still more text clean up that needs to be done, but I’m feeling pretty good so far!

I’ve been working on a couple of examples because I just can’t keep my hands off of the data.

Here’s an example of spoken lines between men and women (all characters) in the show throughout the seasons.

If you’re interested, here’s the counts of characters that were not NA. I used the gender package to create this information. There was also some manual tweaking that needs to be done. Once the text has been cleaned, we should see a minimal number of NA's.

Gender Unique Character Count
Female 247
Male 285

The counts are pretty close, so there’s not an imbalance of male to female characters. There is a rise in women’s dialogue towards the end of the show, which could be attributed to Fraiser’s love interest with Julia Wilcox in seasons 10 and 11 or his desire to settle down.

Analyzing Frasier:

  1. Part One
  2. Part Two
  3. Part Three