Sentiment Analysis of Frasier

Previously:

If you’ve been following along with my previous posts, you know that I’ve been working on prepping data from the television show Frasier. It involved a few different ways of collecting this data, including cleaning it in BASH, scraping and augmenting data in R.

Once I had the data ready, I was able to dive in. In the first part of my analysis, I was able to use the subtools package in R as well as data from IMDB.com to create a time-series sentiment analysis. Orginally I thought I would be able to do this using tidytext but I found that using only unigrams with the package didn’t really capture the full picture of each sentence. The sentimentr package proved to be useful for what I was going for. You can check it out here.

In the second analysis I tackled more detailed information from the transcripts at kacl780.net. I augmented information from those transcripts with data from IMDB and the gender package in R. There’s more information available than I actually have time for at this point, so I’m hoping that I can revisit it at some point in the future.

I’m also hoping I can combine some of the info that I didn’t use into a shiny app that will present it in a nice tidy format.

If you remember from my very first post, I really just wanted to duplicate what I saw on reddit with information from Frasier. ‘The Office’ Characters’ Most Distinguishing Words..

The original person who posted this analysis didn’t include links to their analysis which makes reproducing it pretty difficult. Here’s what I came up with on my own in R:

tidyTranscripts %>%
  filter(characterType == 'main') %>%
  anti_join(stop_words) %>%
  count(character,
        word,
        sort = TRUE) %>%
  group_by(word) %>%
  mutate(totalTimesSaid = sum(n)) %>%
  ungroup() %>%
  mutate(percentage = n/totalTimesSaid * n) %>%
  group_by(character) %>%
  mutate(characterSpoken = n()) %>%
  ungroup() %>%
  mutate(anyoneSpoken = n(),
         uniquenessOfWord = percentage * anyoneSpoken/characterSpoken) %>%
  group_by(word) %>%
  top_n(1, uniquenessOfWord) %>%
  ungroup() %>%
  group_by(character) %>%
  top_n(5, uniquenessOfWord) %>%
  ungroup() %>%
  select(character, word, uniquenessOfWord) %>%
  arrange(character, desc(uniquenessOfWord))

Which gives us:

character	word	uniquenessOfWord
Daphne	crane	1943.0577
Daphne	dr	1857.6753
Daphne	mum	479.5264
Daphne	bloody	238.4252
Daphne	nice	168.8845
Frasier	niles	4413.9584
Frasier	dad	3442.1692
Frasier	roz	2905.5846
Frasier	god	1102.0595
Frasier	time	861.8832
Martin	yeah	2141.1908
Martin	hey	1785.0858
Martin	fras	853.3099
Martin	eddie	770.8073
Martin	guys	625.8700
Niles	frasier	1445.4581
Niles	maris	762.1647
Niles	wait	227.8986
Niles	mel	136.2381
Niles	maris’s	120.6549
Roz	alice	348.1129
Roz	line	217.9855
Roz	martin	135.2960
Roz	roger	125.1669
Roz	weird	102.8421

Now this is pretty close to the original point of the reddit analysis, but I found it a bit of a challenge since some characters share the same words. For example, Frasier and Niles both say “dad” a lot, Frasier says it about twice as much as Niles. To overcome that, I just took the top character by word and gave credit to them. Also for Daphne, she says “Dr. Crane” a lot, referring to both brothers, but it’s tough to see that since we’re using unigrams instead of bigrams. You might consider removing one or the other references.

There’s still so much more that can be done with the data, I am just going to work on it in small chunks as I go.

Chip Oglesby

Data Engineer in Canton, NC | Taking it one day at a time

Sentiment Analysis of Frasier