In my previous post we looked an unsuccessful way of trying to join data from subtitles and the transcripts.
We also took a peak at counts of lines between men and women across all eleven seasons.
Since then, I’ve been able to use
rvest to scrape data from IMDB and Wikipedia
and aded them to the subtitles and transcripts data frames. Adding the character
names by season and episode from IMDB was particularly helpful in removing
I did hear back from the creator of the
subtools package and he gave me a
really interesting idea.
Hi! Not sure I understand the question. You can do what you want with the timecode. It's a precious information for time series analysis— Francois Keck (@FrancoisKeck) April 13, 2018
When I mutate the original air date with the time code out, I was able to a new variable that would let me perform a time-series analysis using sentiment data.
Let’s take a look at the data I’ll use for this portion of the analysis:
|90||00:06:03.729||for once you’ll face the consequences of hanging up on callers||1||7||Call Me Irresponsible||James Burrows||Anne Flett & Chuck Ranberg||1993-10-28||27||439||7.9|
|91||00:06:06.091||what consequences||1||7||Call Me Irresponsible||James Burrows||Anne Flett & Chuck Ranberg||1993-10-28||27||439||7.9|
|92||00:06:10.108||l’m marco’s girlfriend.. excuse me, ex-girlfriend, thanks to you.||1||7||Call Me Irresponsible||James Burrows||Anne Flett & Chuck Ranberg||1993-10-28||27||439||7.9|
|93||00:06:12.639||the marco who didn’t want to commit||1||7||Call Me Irresponsible||James Burrows||Anne Flett & Chuck Ranberg||1993-10-28||27||439||7.9|
|94||00:06:17.502||you damned radio shrinks you couldn’t just tell him to stick with it!||1||7||Call Me Irresponsible||James Burrows||Anne Flett & Chuck Ranberg||1993-10-28||27||439||7.9|
We’ll perform a sentiment analysis using the
tidytext package. After the
analysis, we’ll use this data for the next step:
Bing lexicon looks all of the words in each line, unnests them and then
assigns them a binary value of positive or negative like the example below:
After creating the
dateTimeOut variable, we can then create another variable
for each minute of the show and take the mean polarity of the entire minute.
R code for reference:
tidyFrasier %>% filter(season == 1, episode == 7) %>% inner_join(bing) %>% arrange(season, episode, id) %>% mutate(minute = as.numeric(minute(dateTimeOut)), sentiment = as.factor(sentiment), episode = as.factor(episode)) %>% group_by(season, episode, minute) %>% count(sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(polarity = positive - negative)
Now we can plot the data giving us this chart for this particular episode:
There’s more that you can do with this data, but we’ll take a look at that later.
In the next post we’ll dive in the transcript data and take a deeper look there.