# Using Spline Interpolation in SQL to Analyze Sparse Data

#### Spline Interpolation using SQL

Splines, piecewise polynomials segmented by discrete points, are known to be a good approximation for many real-world scenarios. Data scientists often use spline interpolation to produce smooth graphs and estimate missing values by “filling in” the space between discrete points of data.

Consider the following graph, which shows the temperature every 4 hours in Yosemite Park in early July.

The graph appears jagged as we have data at 4 hour intervals. Spline interpolation can be used to fill in the areas between points while producing a much smoother graph:

Here we explore how to interpolate the data with spline interpolation using just SQL. We make use of Common-Table Expressions (CTEs) and time windowing functions.

**The Method**

Spline interpolation produces a piecewise cubic polynomial function for each pair of points. These distinct polynomial functions join smoothly at each data point. This is distinct from a rolling average, where the line rarely intersects a data point. In practice, a rolling average is good to smooth high variance data, while spline interpolation is used to fill in sparse data. This wikipedia article on spline interpolation is a good introduction to the topic, and we’ll be referring to the various equations in that article here.

Our primary task is to solve for the variables in 3 equations - equation 15 constrains the polynomial functions to be smooth at each data point, and equations 16 and 17 constrain the curvature at the end points to be straight lines.

As the article shows, these 3 equations can be represented in matrix form. Conveniently, the coefficient matrix is a tridiagonal matrix which can be solved using the Thomas Algorithm, as explained here.

The interpolation algorithm normally uses the entire data set to determine the polynomial functions. In our implementation, however, we consider a subset of 4 nearby data points when determining the coefficients of each polynomial (also called windowing).

The resulting matrices for our study (n=4) are useful for explanatory purposes and produce the smooth graphs we are looking for, but you may find the need to increase the window size when using real world data.

**The Source Data**

To begin, we first download temperature data from the National Oceanic and Atmospheric Administration (NOAA) website, cut it down to the first 2 days of July 2017, and insert into a table as follows:

create tabletemperatures (hourinteger, temperaturedouble precision);insert intotemperatures (hour, temperature)values

(1,19.5),

(2,19.5),

(3,19.7),

…

(176,14.6);

In order to simulate a rough sampling, we extract the values every 4 hours:

withxyas(

select

cast(hourasdouble precision)asx

, temperatureasyfromtemperatureswhere

cast(hourasinteger) % 4 = 0order by

hourlimit

12

)

This is what we plot to get the graph in fig. 1 above.

**Generating the Series**

In order to interpolate we need additional points in the x-axis to produce a higher resolution. We join with our original data set so we can see the actual data set. We use a * recursive* function here to add ad a point for each 1/10 of an hour.

with recursive

, x_series(input_x)as(

select

min(x)

from

xy

union all

select

input_x + 0.1

from

x_series

where

input_x + 0.1 < (select max(x)fromx)

)

, x_coordinateas(

select

input_x

,max(x)over(order byinput_x)asprevious_x

from

x_series

left join

xyon abs(x_series.input_x - xy.x) < 0.001

)

**Preparing the Input Variables - The 4 Points in Each Window**

As mentioned, we will use a window of 4 data points around each data point. This is easy to do by creating columns for the (x, y) coordinate for each of the 4 data points by leveraging the * lag* and

*functions.*

**lead**fullxyas(

select

x

, y

,lag(x, 1)over(order byx)asx0

,lag(y, 1)over(order byx)asy0

, xasx1

, yasy1

,lead(x, 1)over(order byx)asx2

,lead(y, 1)over(order byx)asy2

,lead(x, 2)over(order byx)asx3

,lead(y, 2)over(order byx)asy3

from

xy

order by

x

)

This produces a table like the following:

As we can see, for each point (x, y), we also duplicate the (x, y) coordinates of the point before it (x0, y0) and the two points following it (x2, y2) and (x3, y3), for a total of 4 data points per row.

**Producing the Tridiagonal Coefficients**

, tridiagonalas(

select*

, 3*(y1-y0) /power(x1-x0, 2)asd0

, 3*((y1-y0)/(power(x1-x0, 2)) + (y2-y1)/(power(x2-x1, 2)))asd1

, 3*((y2-y1)/(power(x2-x1, 2)) + (y3-y2)/(power(x3-x2, 2)))asd2

, 3*(y3-y2) /power(x3-x2, 2)asd3

, 1 / (x1-x0)asa1

, 1 / (x2-x1)asa2

, 1 / (x3-x2)asa3

, 2 / (x1-x0)asb0

, 2 * (1/(x1-x0) + 1/(x2-x1))asb1

, 2 * (1/(x2-x1) + 1/(x3-x2))asb2

, 2 / (x3-x2)asb3

, 1 / (x1-x0)asc0

, 1 / (x2-x1)asc1

, 1 / (x3-x2)asc2

fromfullxyorder byx)

Now, for each row, we have the necessary *a*, *b*, *c* and *d* vectors for the tridiagonal matrix.

**Finding the Coefficients for the Tridiagonal Matrix Using the Thomas Algorithm**

The Thomas Algorithm can now be used to solve the tridiagonal matrix we created. The algorithm uses a series of 'sweeps' across the matrix to find a solution. First there is a forward sweep, where the c prime and d prime values are calculated, and then a backwards substitution, where the solution slope vector is solved. We unroll the calculation into distinct sweeps and solve in a series of CTEs behaving like nested subqueries.

forward_sweep_1as(

select

*

, c0 / b0asc_prime_0

, c1 / (b1 - a1 * c0 / b0)asc_prime_1

, c2 / (b2 - a2 * (c1 / (b1 - a1 * c0 / b0)))asc_prime_2

, d0 / b0 as d_prime_0

, (d1 - a1 * d0 / b0) / (b1 - a1 * c0 / b0)asd_prime_1

from

tridiagonal

)

, forward_sweep_2as(select

*

, (d2 - a2 * d_prime_1) / (b2 - a2 * c_prime_1)asd_prime_2

, (d3 - a3 * ((d2 - a2 * d_prime_1) / (b2 - a2 * c_prime_1)))

/ (b3 - a3 * c_prime_2)asd_prime_3

from

forward_sweep_1

)

, backwards_substitution_1as(

select

*

, d_prime_3ask3

, d_prime_2 - c_prime_2 *(d_prime_3)ask2

, d_prime_1 - c_prime_1 * (d_prime_2 - c_prime_2 *(d_prime_3))ask1

from

forward_sweep_2

)

, backwards_substitution_2as(select

*

, d_prime_0 - c_prime_0 * k1ask0

from

backwards_substitution_1

)

**Determining the Polynomial Coefficients**

Now, we’re ready to determine the polynomial coefficients *a* and *b* (equations 10 & 11)

, coefficientsas(

select

x, y, x0, y0, x1, y1, x2, y2, k0, k1, k2, k3

k1 * (x2-x1) - (y2-y1)asa,

-k2 * (x2-x1) + (y2-y1)asbfrom

backwards_substitution_2

)

**Final Step: Calculating the Temperature at Each Interpolated Point**

Here we calculate *t* (equation 2) and output *y* values (equation 1).

, interpolated_tableas(

select

*

, (input_x - x1) / (x2 - x1)ast

from

x_coordinate,

coefficients

where

abs(coefficients.x1 - x_coordinate.previous_x) < 0.001

)

, final_tableas(

select

*

, (1-t) * y1 + t*y2 + t*(1-t)*(a*(1-t)+b*t)asoutput_y

from

interpolated_table

order byinput_x

)

The following is the final * select* clause.

*output_y*contains the interpolated values. We also include the original (

*x*,

*y*) coordinate as well as

*scatter_y*so we can display the original data points on a scatter plot.

selectround(cast(input_xas numeric), 2)asinput_x

, x

, y

,round(cast(output_y as numeric), 2)asoutput_y

,case

whenabs(input_x - x) < 0.01theny

else null

end asscatter_yfrom

final_tableorder by

x

The output is shown below:

**Final Considerations**

And there you have it! Using only SQL, we have used spline interpolation to generate smooth graphs. For the sake of brevity, we excluded data points on the left and right most of the data set but they can be added back into your analysis. Using splines for your graphs makes time series easier to read and pushes the limits of SQL.

Happy scripting!