About three years ago, I received a letter in the mail from Nielsen inviting me to participate in one of their panels. After spending a while on the phone with a representative to determine that it wasn’t a scam, I figured I’d give it a go. I tend to take great interest in knowing where data come from (especially when reporters and media sources try to use statistics to make a point), and as an avid tv watcher, it was cool to learn more about how Nielsen generates ratings and estimates program viewership. My randomly sampled “household” would represent thousands of similarly characterized “households” - white male, early 20s, unmarried, lives alone, etc. - and as long as I kept a small meter in my pocket, my televison-watching habits would contribute to “the ratings” (in retrospect, I’m sure I watched way more RuPaul’s Drag Race and Project Runway than the average early 20s male…👀).
After completing the panel, I’ve been more curious about how companies use Nielsen data to inform decision making (i.e. renewing/canceling shows, optimizing time slots), and about how the data are shared publicly (snippets of ratings information get shared on Wikipedia, on blogs, etc). Since I’ve been wanting to start blogging more, I thought it would be fun to write my first post about how to wrangle and analyze TV ratings data in R!
Finding and tidying data
I happened to come across this blog, which has a wealth of data on tv/movie ratings, such as weekly reports on the viewership of every program on a major broadcast network (here’s an example of one such report, which I’ll try to download and tidy).
In order to read in the PDF text in R, I decided to use
pdftools. There are a few other packages out there that do a similar job, like
tabularize, but I found
pdftools to be the simplest and easiest to use. Given the formatting of the table (variable names and values containing spaces, nothing cleanly or consistently delimited), there was no clean and simple way to turn the text into a data frame, so I had to use some clever regular expressions to turn this:
” Complete Nielsen Ratings: Jan. 29 - Feb 4, 2018\nP2+ R A18-34 R A18-49 R A25-54 R Day Time Net Program P2+ * A18-34 A18-49 A25-54 P2 +/-\n 19 5 8 11 Mon 8:00 ABC The Bachelor 6.36 1.5 1.7 2.0\n 32 45 49 47 Mon 10:00 ABC The Good Doctor [R] 5.00 0.5 0.8 1.1 -21%\n 12 22 15 10 Mon 8:00 CBS Kevin Can Wait 7.28 0.6 1.2 2.0\n 17 37 21 13 Mon 8:30 CBS Man with a Plan 6.64 0.5 1.1 1.8 -9%\n 25 42 30 18 Mon 9:00 CBS Superior Donuts 5.80 0.5 1.0 1.7 -13%\n 34 52 40 32 Mon 9:30 CBS 9JKL 4.82 0.4 0.9 1.4 -17%\n 24 51 38 29 Mon 10:00 CBS Scorpion 5.82 0.4 0.9 1.4 21%\n”
into something nice and tidy like this!
latest_date = "2018-01-29" get_ratings(latest_date)
While the full code for the function
get_ratings can be found in this Github repo, I’ll explain a little about the regular expressions I used. Let’s use this one line of text as an example:
" 32 45 49 47 Mon 10:00 ABC The Good Doctor [R] 5.00 0.5 0.8 1.1 -21%"
First, I identified some groups of elements to extract together. Since I have a fairly limited knowledge of regex semantics, the process of extracting elements from these lines was made significantly easier by using the
rebus package in R. Here’s a flow of how the
rebus expressions I wrote translated as regular expressions, followed by the elements extracted:
rebus 🙏🏻. Next, I removed all of the elements above from the string to leave the network and program name, which I then split up into separate elements. All in all, it took a lot of trial and error to figure out the best regular expressions patterns to use, and while this may not be the prettiest or most efficient way to extract the data, it got the job done!
Visualizing the data
Once the data were tidy, the fun could begin! First, I wanted to see what the 20 most viewed programs were last week. I decided to exclude any reruns of shows to keep the list about new TV content.
data %>% filter(rerun == FALSE) %>% top_n(n=20,wt = viewers_millions) %>% select(network,program,viewers_millions,P2_ranking) %>% ggplot()+geom_bar(aes(x = reorder(program,viewers_millions),y = viewers_millions,fill=network),stat = 'identity')+coord_flip()+ labs(x = "TV Show", y = "Viewers (millions)", title="Most Watched Programs on Broadcast TV from January 29-February 4, 2018") + theme_minimal() + scale_y_continuous(expand=c(0,0))
Unsurprisingly, the Super Bowl and its Post Game commentary dominated the weekly ratings, and This Is Us probably benefitted a great deal from the post-Super Bowl time slot. CBS seems to have a lot of popular shows, while Fox and the CW didn’t even make the list. Unfortunately, data from each network’s airing of the State of the Union were not included in the PDF, so they did not make the cut either (though contrary to what one man thinks, this was not the most watched SOTU ever).
Next, I was curious to see which network received the highest average number of views on a daily basis. I decided to just look at Monday-Saturday, since we all know what everyone was watching this past Sunday…
data %>% group_by(network,weekday) %>% filter(!weekday == "Sun") %>% summarise(avgviews = mean(viewers_millions,na.rm=TRUE)) %>% ggplot()+geom_line(aes(x = weekday,y = avgviews,group=network,col=network))+ scale_x_discrete(limits = c("Mon","Tue","Wed","Thu","Fri","Sat"))+ labs(y = "Average Number of Viewers (millions)",title = "Average Daily Number of Viewers by Network")+theme_minimal() + scale_y_continuous(expand=c(0,0))
It looks like the CW consistently underperforms in ratings compared to the other major broadcast networks. Also, Thursday night is clearly the best night for ABC (Shondaland, anybody??🍷🍿).
Lastly, I was wondering how viewership and ratings have changed over the past month. By defining a vector of dates over the past month when the PDFs were posted, and then using the
map_df function in the
purrr package, I was able to pretty easily gather all of the past month’s data in a single data frame. Again, I restricted the rankings to only show new content, and I compiled lists of the 10 most viewed programs by week (I used
tidyr to distribute the resulting names across four columns, which formed a neat table to display!)
dates = as.Date(latest_date)-7*c(0:3) dates %>% map_df(~get_ratings(.)) %>% filter(rerun == FALSE) %>% group_by(start_of_week) %>% top_n(n=10,wt = viewers_millions) %>% arrange(desc(start_of_week),desc(viewers_millions)) %>% select(program,start_of_week) %>% ungroup() %>% bind_cols(ranking = rep(1:10,4)) %>% spread(start_of_week,program) %>% kable("html") %>% kable_styling()
|1||NFL Playoff Overrun||NFC Championship Game: PHI v MIN||60th Grammy Awards||Super Bowl LII: PHI v NE|
|2||NFL Div. Playoff: TEN v NE||NFC Post-Game||NCIS||SB Post Game|
|3||The OT||NFC Trophy||BULL||This Is Us|
|4||The Big Bang Theory||The Big Bang Theory||The Good Doctor||The Big Bang Theory|
|5||NCIS||Young Sheldon||This Is Us||Young Sheldon|
|6||Young Sheldon||This Is Us||NCIS: New Orleans||Blue Bloods|
|7||BULL||Blue Bloods||Grey's Anatomy||Mom|
|8||Blue Bloods||The Good Doctor||Ellen's Game of Games||Grey's Anatomy|
|9||This Is Us||MOM||Kevin Can Wait||Hawaii Five-O|
|10||MOM||Hawaii Five-O||Grammy Red Carpet||Ellen's Game of Games|
Again, no shock that Americans looooooove their football… 🏈
So there you have it! I had a lot of fun exploring these data, and there are tons of other visualizations and analyses that these data can be used for. Again, the code with the function to obtain the data can be found on Github - I would love to see what else you can come up with while using it! 😬🎉