## The High-Performance SQL Blog

# IMDb vs RottenTomatoes Ratings with SQL Trendlines

IMDb vs RottenTomatoes Ratings with SQL TrendlinesIn this post, we’ll explore how to calculate trendlines from a scatter plot of data. Sometimes the trend is obvious, but other times not so obvious. Calculating the precise linear trendline can give us a good idea of how one variable correlates with another. We can use SQL operations to do this!

Let’s work through creating the following chart comparing IMDb and RottenTomatoes ratings.

First, consider the dataset of the top 50 movies by domestic gross (not adjusted for inflation) from each year since 1980.

select*frommovie_dataorderbygross_dollarsdesclimit10;

We have some interesting data. Let’s look at the ratings.

### Finding the Correlation Between IMDb and Tomatometer Ratings

How do IMDb and Tomatometer ratings relate? We’d like to understand the relationship between IMDb ratings and the RottenTomatoes Tomatometer. Does a movie rating on one scale translate directly to the other, or does one of the databases skew towards a different scale?

select

title

, tomatometer

, imdbrating*10asimdbrating_n

from

movie_data

orderby

tomatometerasc

This query returns each title with its Tomatometer and IMDb ratings. IMDb ratings are between 0.0 and 10.0, so we scale it by 10 to normalize to the Tomameter scale from 0 to 100. With Periscope Data, we can easily take this tabular data and build a scatter plot.

It appears the IMDb ratings correlate linearly with Tomatometer scores. With the help of easy calculations guide, we can calculate the the slope and intercept of the least square regression line with a few lines of SQL.

with

-- define the dataset with X and Y columns

datasetas(

select

title

, tomatometerasX

, imdbrating*10asY

from

movie_data

orderby

tomatometerasc

)

-- calculate all the primary values

, termsas(

select

count(*)ascnt

,sum(X)assum_X

,sum(Y)assum_Y

,sum(X*Y)assum_X_times_Y

,sum(X*X)assum_X_squared

from

dataset

)

-- use the primary values to calculate the slope

, slopeas(

select

(

cnt*sum_X_times_Y-sum_X*sum_Y

)

/(cnt*sum_X_squared-sum_X*sum_X)asslope

from

terms

)

-- use the primary values and slope to calculate the intercept

, interceptas(

select

(

terms.sum_Y-slope.slope*terms.sum_X

)

/terms.cntasintercept

from

terms

, slope

)

-- select the slope and interface

select

*

from

slope

, intercept

Results:

Cool! Now we know the parameters. It appears that each IMDb rating earns about 49 points on the Tomatometer for free, but for every point increase on the Tomatometer, the IMDb rating increases just .27 points. This means movies with low Tomatometer scores tend to be overrated on IMDb and those with high scores tend to be underrated on IMDb. We can calculate the point at which the trend goes from overrated to underrated by finding the point of intersection of the trendline and the line y = x. This comes to about 68 points on the Tomatometer.

Instead of just returning the slopes and intercept, we can alter the query to plot points on the trendline by replacing

select

*

from

slope

, intercept

with

selectdistinct

tomatometer

, (tomatometer*slope)+interceptasimdbrating_predicted

from

movie_data

, slope

, intercept

orderby

tomatometerasc

Finally, we can overlay the trendline on the scatter plot.

### How Have the Ratings of the Top Grossing Movies Changed Over Time?

Now that we have all the tools we need to build trendlines with SQL, let’s try to answer one more question: how have the ratings of the top grossing movies in each year changed over time? First, we define a query to get the top 15 grossing movies for each year.

with

ranksas(

select

DATE_PART(,'year'year)

-DATE_PART(,'year'::date)'1980-01-01'

asyears_since_1980

, title

, gross_dollars

, tomatometer

, rank() over(partitionbyyearorderbygross_dollarsdesc)

asrank

from

movie_data

)

select

years_since_1980

, title

, gross_dollars

, tomatometer

from

ranks

where

rank<=15

orderby

years_since_1980asc

Replace the dataset of the early calculation with the output of this query and you get the following trendline.

This shows the average rating of the top 15 films has not changed over time. There’s virtually no correlation. With these two examples, we’ve shown how SQL can be used compute trendlines to analyize your data.