Geographic Analysis in SQL: Measuring Polygon Area from Latitude and Longitude

Computing the area of polygons on the Earth's surface can be difficult, especially when the polygons are complex and have a large number of vertices. Here is a pure SQL solution to help you get started with your geographic analysis.

Read More
SQL Text Analysis with Donald Trump’s Tweets

There were many unique aspects of the 2016 U.S. presidential election, with one of the most interesting being Donald Trump’s use of Twitter. As luck would have it, tweets happen to be very easy to extract and put into relational database format.

Read More
Tracking KPIs using SQL for Effective Revenue Growth Analysis

Tracking revenue is crucial to understanding the health of your business. Whether your long-term growth is declining, has plateaued, or is heading upwards helps you define your long term strategy and goals.

Read More
Calculating Proportional Values in SQL

SQL is great for aggregating: counting, summing, and averaging numbers are straightforward tasks in SQL. When looking at a distribution, it is often useful to look at them in context and compare the relative size of each part of the proverbial pie.

Read More
IMDb vs RottenTomatoes Ratings with SQL Trendlines

In this post, we’ll explore how to calculate trendlines from a scatter plot of data. Sometimes the trend is obvious, but other times not so obvious.

Read More
Cohorted Engagement: 7-day and 30-day Metrics

When tracking user growth, it is helpful to look at multiple time periods. In a previous post, Cohorted Engagement with Comparable Time Windows in SQL, we covered creating charts of the form: “How many times does a user play our game in the first 30 days, and how has this varied for the users that started in January, February, and so

Read More
Calculating Distance between Data Centers on a Globe

Calculating distance between two points on a flat plane is straightforward thanks to the Pythagorean theorem. However, what if your plane isn’t flat? What if you need to find the distance between two points on a sphere, like the earth?

Read More
Understanding Outliers with Skew and Kurtosis in SQL

When you approach a distribution of data for the first time, it’s often helpful to pull out summary statistics to understand the domain of the data.

Read More
Predicting Churn using Hypothesis Testing

When dealing with massive quantities of customer data, it can be difficult to answer simple questions like: is a customer going to churn or not?

Read More
Coalesce Down using Correlated Subqueries

Filling in gaps in your dataset usually takes one of two forms. You can interpolate, where intermediate points are generated based on their neighbors, or you can carry values forward.

Read More
Impact or No? -- Simple Event Study in SQL Part I

Timeseries charts can be quite powerful for depicting changes of a metric over time.

Read More
Portfolio Risk Analysis in SQL

Any investment involves some amount of risk. Different people have different amounts of risk that they’re willing to take on when investing. We will do risk analysis of an investment portfolio in SQL in this post.

Read More
Calculating Expected Value vs. Actual Results for Super Bowl Contenders

With the Super Bowl quickly approaching, what better time than now to explore expected value and results in SQL!

Read More
Binomial Options Pricing in SQL

Trees are a great way to visualize different states of a process when measuring uncertainty and making predictions.

Read More
Finding Nearest Neighbors in SQL

It has been almost twenty years since Major League Baseball added a new team. Now there’s talk of one being added in the West.

Read More
Outlier Detection in SQL

Inevitably, the unexpected happens. A historically low-traffic channel brings in 10x the normal amount of users. Or your user login rate drops by half.

Read More
Multi-dimensional Clustering Using K-Means in Postgres SQL

In our previous blog post on one dimensional clustering, we used a known distance between two points to cluster the data in one dimension. However, data can be more complicated in many cases and may need to be clustered using multiple dimensions.

Read More
One Dimensional Clustering in Postgres

Many events tables contain low level information that isn’t always easy to reason about. In this post we’ll cluster low level delivery event data into trips, making further analysis much easier.

Read More
Adding Summary Statistics to your SQL Table

At Periscope Data we continuously help people build, diagnose, and update their data. When approaching a table of data for the first time, it helps our understanding to see summarizing statistics.

Read More
Calculating Significance of A/B Tests in Redshift

“A/B test” and “statistically significant” have quickly become part of the standard business vocabulary as running tests has become more popular in growing businesses.

Read More
Solving the Traveling Salesman Problem with Postgres Recursive CTEs

Many SQL implementations don’t have loops, making some kinds of analysis very difficult. Postgres, SQL Server, and several others have the next best thing — recursive CTEs!

Read More
Extrapolating Data with Day-of-Week Effects

At Periscope, we need to plan ahead to make sure we have enough server capacity. This means predicting future server loads.

Read More
Medians in SQL

Finding the min, max, and avg price of an order in your purchases table is easy. But what about the median price?

Read More
Estimating Conversions with Poisson Distributions

Following up on last week’s post on normal distributions, this week we’re going to show off another common probability distribution.

Read More
Beyond Random() — Normal Distributions in SQL

If you’ve ever had to generate sample data in your database, odds are you’ve started with random() or its equivalent. It’s fast and it’s reliable. Unfortunately, it’s also evenly distributed.

Read More
Use Subqueries and Window Functions to Compute Running Averages

By now, we’re all familiar with the ubiquitous Daily Active Users metric. It’s a wonderful, concrete way to get a sense of your engagement.

Read More
Predicting Exponential Growth with SQL

SQL is great at grouping and counting the data you already have, and with a little help from regressions, SQL can help you project that data into the future.

Read More
HyperLogLog in Pure SQL

Earlier we showed how to make count distinct 50x faster with subqueries. Using probabilistic counting we’ll make count distinct even faster, trading a little accuracy for the increase in speed.

Read More
How to Calculate Confidence Intervals in SQL

Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your $800 service. Great! You’ve made $8k of income. How much should you expect to make this year?

Read More