Safety in San Francisco: A Numbers Game

When I first moved to San Francisco, my mother warned me “not to go looking for trouble”, but this can be difficult when you do not know the city well. Luckily, SF OpenData provides a great public safety dataset, and our good friend Redshift has a few helpful tricks to help make sure trouble doesn’t find us.

Read More

Creating Ranked Position Time Series with SQL

The Los Angeles Lakers were the top NBA team in 2008 but the last in 2016. We wanted to compare their downfall to the rise of other teams in the league, so we built a comparison of each team’s rank in the standings over the past ten years.

Let’s explore how to create this chart of rankings of the teams in the western conference!

Read More

Building a Better Pokédex with SQL

Pokémon Go has taken over the world. In the moments I’m not exploring the city trying to catch ‘em all, I’ve found myself idly wondering about Pokémon Theory. What moves are super effective against which types again? What is a critical hit? And how did those multipliers actually work anyway?

With the quest to calculate the attack multiplier for any and every potential matchup between two pokémon, I set off to explore the internet and build the Pokémon Battle Simulator

Read More

IMDb vs RottenTomatoes Ratings with SQL Trendlines

In this post, we’ll explore how to calculate trendlines from a scatter plot of data. Sometimes the trend is obvious, but other times not so obvious. Calculating the precise linear trendline can give us a good idea of how one variable correlates with another. We can use SQL operations to do this!

Let’s work through creating the following chart comparing IMDb and RottenTomatoes ratings.

Read More

Cohorted Engagement: 7-day and 30-day Metrics

When tracking user growth, it is helpful to look at multiple time periods. In a previous post, Cohorted Engagement with Comparable Time Windows in SQL, we covered creating charts of the form: “How many times does a user play our game in the first 30 days, and how has this varied for the users that started in January, February, and so on?”

Today we’ll extend our previous technique. Our goal is to create charts like this one:

Read More

Feature Analysis with Mean-Time-Between-Usage Calculation

In Periscope Data, you can apply filters to charts in your dashboard. Filters allow you to change certain values in the underlying SQL, such as date range. Users can also create a default set of filters they want to apply to the dashboard, so that whenever they access the dashboard, the charts will first display with your default filters applied.

Our engineering team refers to this default set as the User Data View (UDV). Recently, we wanted to evaluate how people use this feature and see if there are any performance optimizations we could make based on the usage.

Read More

Cohorted Engagement with Comparable Time Windows in SQL

Cohort analysis is a powerful tool for evaluating the behavior of groups of users over time. One of the most common ways to cohort users is by when they started using your product or playing your game.

Here’s a graph of how many games the average user plays for a fictitious game company, cohorted by the month the user started playing:

select
  [users.created_at:month]
  , count(distinct gameplays.id) / count(distinct users.id)
from
  [users+gameplays]
group by
  1

This chart makes it look like the game is getting dramatically less engaging over time - but it’s simply not true! Earlier users have had more time to play the game, so their average games per user is much higher.

To correct for this bias, we can introduce a time bound on each user’s gameplays. This way all users will have the same amount of time to play the game in this chart, and simply being an earlier user won’t bias the data.

Read More

Common Salesforce Reports in SQL

At Periscope Data, we depend on our sales team to bring in revenue from the leads our marketing team generates. Like many sales teams, ours uses Salesforce to manage the processes involved in making a sale.

In this blog post, we’ll go over how to calculate some standard Salesforce reports using SQL, and how you can join that data to customer data to give you a more complete view of your customers.

Read More

Calculating Distance between Data Centers on a Globe

Calculating distance between two points on a flat plane is straightforward thanks to the Pythagorean theorem. However, what if your plane isn’t flat? What if you need to find the distance between two points on a sphere, like the earth?

With a little help from the Haversine Formula it can be done pretty easily in SQL. In this post, we’ll use a fictious game, Scope, to demonstrate the formula.

Upon launching, Scope quickly grew to millions of players across the US. The celebrations didn’t last as we started received feedback of bad lag, or data latency.

Locating our Customers

The developers noticed that these complaints had been originating from the midwest of the U.S. and had a hunch that the latency may have been a result of Scope players being too far from a Scope datacenter.

The first step in this investigation is simply to get the locations of all of Scope’s players. We can easily do this like so:

Select latitude, longitude
From users

Charting this out confirms that Scope does have players all across the U.S. If the feedback about lag was an issue with the game code, we would expect to hear it from all players rather than just a geographic subset.

Read More

Converting MySQL Syntax and Functions into Redshift

Because our cache utilizes Redshift clusters, we often receive questions on how to convert a particular query from MySQL syntax to Redshift syntax. While Google turns up a treasure trove of results for syncing a MySQL database into Redshift, it comes up a bit short for showing how to actually translate the syntax of specific queries.

In this post, we will cover some of the more common MySQL functions that differ in syntax from Redshift, as well as various rules and tricks to keep in mind!

General Differences Between MySQL And Redshift

Grouping

One of the most common pitfalls when converting MySQL syntax to Redshift involves the group by requirements. Redshift is more stringent, and requires that all non-aggregate functions in the select statement must be included in the group by clause. As an example, MySQL syntax allows for a query like so:

select
   [created_at:week],
   country,
   count(1)
from
   orders
group by
   1
Read More