Sentiment Analysis of Frasier

Chip Oglesby bio photo By Chip Oglesby

Previously:

If you’ve been following along with my previous posts, you know that I’ve been working on prepping data from the television show Frasier. It involved a few different ways of collecting this data, including cleaning it in BASH, scraping and augmenting data in R.

Once I had the data ready, I was able to dive in. In the first part of my analysis, I was able to use the subtools package in R as well as data from IMDB.com to create a time-series sentiment analysis. Orginally I thought I would be able to do this using tidytext but I found that using only unigrams with the package didn’t really capture the full picture of each sentence. The sentimentr package proved to be useful for what I was going for. You can check it out here.

In the second analysis I tackled more detailed information from the transcripts at kacl780.net. I augmented information from those transcripts with data from IMDB and the gender package in R. There’s more information available than I actually have time for at this point, so I’m hoping that I can revisit it at some point in the future.

I’m also hoping I can combine some of the info that I didn’t use into a shiny app that will present it in a nice tidy format.

If you remember from my very first post, I really just wanted to duplicate what I saw on reddit with information from Frasier. ‘The Office’ Characters’ Most Distinguishing Words..

The original person who posted this analysis didn’t include links to their analysis which makes reproducing it pretty difficult. Here’s what I came up with on my own in R:

tidyTranscripts %>%
  filter(characterType == 'main') %>%
  anti_join(stop_words) %>%
  count(character,
        word,
        sort = TRUE) %>%
  group_by(word) %>%
  mutate(totalTimesSaid = sum(n)) %>%
  ungroup() %>%
  mutate(percentage = n/totalTimesSaid * n) %>%
  group_by(character) %>%
  mutate(characterSpoken = n()) %>%
  ungroup() %>%
  mutate(anyoneSpoken = n(),
         uniquenessOfWord = percentage * anyoneSpoken/characterSpoken) %>%
  group_by(word) %>%
  top_n(1, uniquenessOfWord) %>%
  ungroup() %>%
  group_by(character) %>%
  top_n(5, uniquenessOfWord) %>%
  ungroup() %>%
  select(character, word, uniquenessOfWord) %>%
  arrange(character, desc(uniquenessOfWord))

Which gives us:

character word uniquenessOfWord
Daphne crane 1943.0577
Daphne dr 1857.6753
Daphne mum 479.5264
Daphne bloody 238.4252
Daphne nice 168.8845
Frasier niles 4413.9584
Frasier dad 3442.1692
Frasier roz 2905.5846
Frasier god 1102.0595
Frasier time 861.8832
Martin yeah 2141.1908
Martin hey 1785.0858
Martin fras 853.3099
Martin eddie 770.8073
Martin guys 625.8700
Niles frasier 1445.4581
Niles maris 762.1647
Niles wait 227.8986
Niles mel 136.2381
Niles maris’s 120.6549
Roz alice 348.1129
Roz line 217.9855
Roz martin 135.2960
Roz roger 125.1669
Roz weird 102.8421

Now this is pretty close to the original point of the reddit analysis, but I found it a bit of a challenge since some characters share the same words. For example, Frasier and Niles both say “dad” a lot, Frasier says it about twice as much as Niles. To overcome that, I just took the top character by word and gave credit to them. Also for Daphne, she says “Dr. Crane” a lot, referring to both brothers, but it’s tough to see that since we’re using unigrams instead of bigrams. You might consider removing one or the other references.

There’s still so much more that can be done with the data, I am just going to work on it in small chunks as I go.