Periscope Data
The world’s fastest visualization platform for data analysts.

The High-Performance SQL Blog

IMDb vs RottenTomatoes Ratings with SQL Trendlines

IMDb vs RottenTomatoes Ratings with SQL Trendlines
July 6, 2016

In this post, we’ll explore how to calculate trendlines from a scatter plot of data. Sometimes the trend is obvious, but other times not so obvious. Calculating the precise linear trendline can give us a good idea of how one variable correlates with another. We can use SQL operations to do this!

Let’s work through creating the following chart comparing IMDb and RottenTomatoes ratings.

First, consider the dataset of the top 50 movies by domestic gross (not adjusted for inflation) from each year since 1980.

select * from movie_data order by gross_dollars desc limit 10;

We have some interesting data. Let’s look at the ratings.

Finding the Correlation Between IMDb and Tomatometer Ratings

How do IMDb and Tomatometer ratings relate? We’d like to understand the relationship between IMDb ratings and the RottenTomatoes Tomatometer. Does a movie rating on one scale translate directly to the other, or does one of the databases skew towards a different scale?

select
  title
  , tomatometer
  , imdbrating * 10 as imdbrating_n
from
  movie_data
order by
  tomatometer asc

This query returns each title with its Tomatometer and IMDb ratings. IMDb ratings are between 0.0 and 10.0, so we scale it by 10 to normalize to the Tomameter scale from 0 to 100. With Periscope Data, we can easily take this tabular data and build a scatter plot.

It appears the IMDb ratings correlate linearly with Tomatometer scores. With the help of easy calculations guide, we can calculate the the slope and intercept of the least square regression line with a few lines of SQL.

with
  -- define the dataset with X and Y columns
  dataset as (
  select
    title
    , tomatometer as X
    , imdbrating * 10 as Y
  from
    movie_data
  order by
    tomatometer asc
  )
  -- calculate all the primary values
  , terms as (
    select
      count(*) as cnt
      , sum(X) as sum_X
      , sum(Y) as sum_Y
      , sum(X * Y) as sum_X_times_Y
      , sum(X * X) as sum_X_squared
    from
      dataset
  )
  -- use the primary values to calculate the slope
  , slope as (
    select
      (
        cnt * sum_X_times_Y - sum_X * sum_Y
      )
      / (cnt * sum_X_squared - sum_X * sum_X) as slope
    from
      terms
  )
  -- use the primary values and slope to calculate the intercept
  , intercept as (
    select
      (
        terms.sum_Y - slope.slope * terms.sum_X
      )
      / terms.cnt as intercept
    from
      terms
      , slope
   )
-- select the slope and interface
select
  *
from
  slope
  , intercept

Results:

Cool! Now we know the parameters. It appears that each IMDb rating earns about 49 points on the Tomatometer for free, but for every point increase on the Tomatometer, the IMDb rating increases just .27 points. This means movies with low Tomatometer scores tend to be overrated on IMDb and those with high scores tend to be underrated on IMDb. We can calculate the point at which the trend goes from overrated to underrated by finding the point of intersection of the trendline and the line y = x. This comes to about 68 points on the Tomatometer.

Instead of just returning the slopes and intercept, we can alter the query to plot points on the trendline by replacing

select
  *
from
  slope
  , intercept

with

select distinct
  tomatometer
  , (tomatometer * slope) + intercept as imdbrating_predicted
from
  movie_data
  , slope
  , intercept
order by
  tomatometer asc

Finally, we can overlay the trendline on the scatter plot.

How Have the Ratings of the Top Grossing Movies Changed Over Time?

Now that we have all the tools we need to build trendlines with SQL, let’s try to answer one more question: how have the ratings of the top grossing movies in each year changed over time? First, we define a query to get the top 15 grossing movies for each year.

with
  ranks as (
    select
      DATE_PART('year', year)
        - DATE_PART('year', '1980-01-01'::date)
          as years_since_1980
      , title
      , gross_dollars
      , tomatometer
      , rank() over(partition by year order by gross_dollars desc)
        as rank
    from
      movie_data
  )
select
  years_since_1980
  , title
  , gross_dollars
  , tomatometer
from
  ranks
where
  rank <= 15
order by
  years_since_1980 asc

Replace the dataset of the early calculation with the output of this query and you get the following trendline.

This shows the average rating of the top 15 films has not changed over time. There’s virtually no correlation. With these two examples, we’ve shown how SQL can be used compute trendlines to analyize your data.

Haven't tried Periscope Data yet?
Start a trial and we’ll send you one of our famous coffee mugs.
Read More
Haven’t tried Periscope Data yet?