Automated Identification and Graphing of SQL Dependencies

Over time, your company’s data model becomes increasingly complex. Visualizing your data model can help you understand how to optimize your ETL processes, but requires time and effort to code into a graph. Here’s a way to do it in an automated fashion by using Python, GraphViz, and some clever RegEx

Read More
How to Get Started With AWS Spectrum in Minutes

AWS Redshift Spectrum is Amazon’s newest database technology, allowing exabyte scale data in S3 to be accessed through Redshift. This is not simply file access; Spectrum uses Redshift's brain, deploying workers by the thousands to order, join and aggregate your data before sending the minimum amount of data needed back to your Redshift cluster to finish your query. Learn how to get started!

Read More
Using Spline Interpolation in SQL to Analyze Sparse Data

Data scientists use spline interpolation to produce smooth graphs and estimate missing values by “filling in” the space between discrete points of data. We can use SQL to better approximate our real world data.

Read More
Net Promoter Score and Survey Analysis in SQL

Introduced by Fred Reichheld in 2003, Net Promoter Score (NPS) is a simple method for measuring the likelihood your customers will recommend your product or service. It requires that you survey respondents and ask one simple question: “How likely are you to recommend [Company/Product/Service] to a friend or colleague?”

Read More
Calculating Proportional Values in SQL

SQL is great for aggregating: counting, summing, and averaging numbers are straightforward tasks in SQL. When looking at a distribution, it is often useful to look at them in context and compare the relative size of each part of the proverbial pie.

Read More
Valuing User Subscriptions Over Time with SQL

We recently launched the instant-classic underwater warfare game Scope: Battle for the High Seas, and our investors are excited to see how the game’s subscriptions are growing.

Read More
How to Calculate Confidence Intervals in SQL

Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your $800 service. Great! You’ve made $8k of income. How much should you expect to make this year?

Read More
Cohorted Engagement: 7-day and 30-day Metrics

When tracking user growth, it is helpful to look at multiple time periods. In a previous post, Cohorted Engagement with Comparable Time Windows in SQL, we covered creating charts of the form: “How many times does a user play our game in the first 30 days, and how has this varied for the users that started in January, February, and so

Read More
Calculating Distance between Data Centers on a Globe

Calculating distance between two points on a flat plane is straightforward thanks to the Pythagorean theorem. However, what if your plane isn’t flat? What if you need to find the distance between two points on a sphere, like the earth?

Read More
Predicting Churn using Hypothesis Testing

When dealing with massive quantities of customer data, it can be difficult to answer simple questions like: is a customer going to churn or not?

Read More
Simple Lead Scoring with Enrichment

How can a sales team prioritize which leads to focus on? How can a marketing team measure the quality of the leads they produce?

Read More
Impact or No? -- Simple Event Study in SQL Part I

Timeseries charts can be quite powerful for depicting changes of a metric over time.

Read More
Rolling Classifications by X Previous Aggregations

Rolling averages are useful for comparing a specific point with those around it. However, in some cases we might want to instead classify and group a period’s data based on its contemporary periods.

Read More
Binomial Options Pricing in SQL

Trees are a great way to visualize different states of a process when measuring uncertainty and making predictions.

Read More
Funnel Metrics - Step-over-Step Drop Rates

Funnels are an extremely flexible analysis tool that can provide clear and actionable insights. You can apply them to any situation with a notion of linear progression and where per step drop-off is important.

Read More
Multi-dimensional Clustering Using K-Means in Postgres SQL

In our previous blog post on one dimensional clustering, we used a known distance between two points to cluster the data in one dimension. However, data can be more complicated in many cases and may need to be clustered using multiple dimensions.

Read More
One Dimensional Clustering in Postgres

Many events tables contain low level information that isn’t always easy to reason about. In this post we’ll cluster low level delivery event data into trips, making further analysis much easier.

Read More
Calculating Significance of A/B Tests in Redshift

“A/B test” and “statistically significant” have quickly become part of the standard business vocabulary as running tests has become more popular in growing businesses.

Read More
Analyzing Your Salesforce Data With SQL

When it comes to customer relationship management software, Salesforce is the undisputed king, so it is no surprise that there are many folks looking to integrate Salesforce data with other sources for a deeper look into performance.

Read More
Solving the Traveling Salesman Problem with Postgres Recursive CTEs

Many SQL implementations don’t have loops, making some kinds of analysis very difficult. Postgres, SQL Server, and several others have the next best thing — recursive CTEs!

Read More
Building Your Own User Analytics System In SQL

We love analyzing data, but found some of the existing web analytics solutions like Google Analytics too constraining.

Read More
Explain and Other Tools for Query Optimization

As the datasets you work with get larger, queries that were once lightning fast can slow to a crawl. When this happens, it’s time to optimize.

Read More
Getting the First Row per Group 5X Faster

Previously we’ve written about selecting and joining the first row in each group in the context of analysis queries. While it’s OK for analysis queries to take a few minutes — or a few seconds if you use Periscope’s cache — production queries need to run in tens of milliseconds.

Read More
Extrapolating Data with Day-of-Week Effects

At Periscope, we need to plan ahead to make sure we have enough server capacity. This means predicting future server loads.

Read More
Daily, Weekly and Monthly Charts On The Same Graph

Depending on your data, looking at the same metric with daily, weekly or monthly aggregations can yield startlingly different results.

Read More
Redshift User Defined Functions in Python

Today the Redshift team announced support for User Defined Functions! UDFs allow you to create your own SQL functions, backed by the power and flexibility of Python. Now you can add your favorite functions from other databases, or invent new ones, to make your data analysis much easier.

Read More
The Lazy Analyst's Guide To Postgres JSON

JSON support is the most interesting new Postgres feature of the last few years. It relaxes the primary constraint of SQL databases — the rigid schema structure — by letting you store semistructured data in your tables alongside other data.

Read More
Reuse Calculations in the Same Query with Lateral Joins

Reusing parts of computations has long been a wart in SQL. Let’s say you want to compute confidence intervals for your signup rate. You’ll have to use the standard error calculation twice: once for the upper bound, and once for the lower bound.

Read More
Using Row Numbering And Full Joins To Transpose Wide Tables

Many a MySQL or Postgres DBA, with responsibility for serving a fast and responsive website, will design a schema for a survey that looks like this:

Read More
Fun with Window Functions

Window functions are a wonderfully useful SQL technique. They make complex aggregations simple to build.

Read More
Using Self Joins To Calculate Your Retention, Churn, And Reactivation Metrics

Most sophisticated user analysis looks at repeat usage: How many users come back? How many don’t? How many come back after an extended absence?

Read More
How To Optimize Lifetime Distinct Counts Using Window Functions

Lifetime metrics are a great way to get a long-term sense of the health of business. They’re particularly useful for seeing how healthy a segment is by comparing to others over the long term.

Read More
Predicting Exponential Growth with SQL

SQL is great at grouping and counting the data you already have, and with a little help from regressions, SQL can help you project that data into the future.

Read More
How To Calculate Cohort Retention in SQL

‍Losing users sucks. Losing customers really sucks. If you’re a startup you know that Retention is King.

Read More
HyperLogLog in Pure SQL

Earlier we showed how to make count distinct 50x faster with subqueries. Using probabilistic counting we’ll make count distinct even faster, trading a little accuracy for the increase in speed.

Read More
How To Sample Rows in SQL 273X Faster

Sampling is an incredibly powerful tool to speed up analyses at scale. While it’s not appropriate for all datasets or all analyses, when it works, it really works.

Read More
How to Calculate Confidence Intervals in SQL

Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your $800 service. Great! You’ve made $8k of income. How much should you expect to make this year?

Read More
Use window functions for time-series percentages

When visualizing gameplays by platform over time — a chart every multiplatform game company knows well — it’s typical to start with a simple query:

Read More
2X Your Redshift Speed With Sortkeys and Distkeys

Like a lot of folks in the data community, we’ve been impressed with Redshift, Amazon’s new distributed database. Yet at first, we couldn’t figure out why performance was so variable on seemingly-simple queries.

Read More
Use generate_series to get continuous results

When you’re just starting out, chances are you’ll be asking a very simple question: How many widgets did I sell each day?

Read More
How To Speed Up Where Clauses By 3,000X

Front and center on my Periscope dashboard is a question I ask all the time: How much usage has there been today?

Read More
Use Subqueries to Count Distinct 50X Faster

Count distinct is the bane of SQL analysts, so it was an obvious choice for our first blog post.

Read More