Common Statistical Operations
Calculating Distance between Data Centers on a Globe
Calculating distance between two points on a flat plane is straightforward thanks to the Pythagorean theorem. However, what if your plane isn’t flat? What if you need to find the distance between two points on a sphere, like the earth?
With a little help from the Haversine Formula it can be done pretty easily in SQL. In this post, we’ll use a fictious game, Scope, to demonstrate the formula.
Upon launching, Scope quickly grew to millions of players across the US. The celebrations didn’t last as we started receiving feedback of bad lag, or data latency.
Locating our Customers
The developers noticed that these complaints had been originating from the midwest of the U.S. and had a hunch that the latency may have been a result of Scope players being too far from a Scope datacenter.
The first step in this investigation is simply to get the locations of all of Scope’s players. We can easily do this like so:
Select latitude, longitude
Charting this out confirms that Scope does have players all across the U.S. If the feedback about lag was an issue with the game code, we would expect to hear it from all players rather than just a geographic subset.
Calculating Distance with Haversine
After a discussion with the data center vendor, we learned that lag becomes noticeable once someone reaches 750 miles or more from a data center. Knowing this, we decide to calculate the distance of each customer to each data center.
The Haversine formula can be represented as:
We can use 3960 miles as the radius of the earth and phi 1 and lambda 1 as the coordinates of our current data center in NY (40.748817, -73.985428) and SF (37.7749,-122.389977). Phi 2 and lambda 2 will be represented by our players coordinates.
In SQL we can express this as:
select latitude, longitude
2 * 3960 * asin(sqrt((sin(radians((latitude - 37.7749) / 2)))
^ 2 + cos(radians(37.7749)) * cos(radians(latitude))
* (sin(radians((longitude - -122.389977) / 2)))^ 2)) < 750
2 * 3960 * asin(sqrt((sin(radians((latitude - 40.748817) / 2)))
^ 2 + cos(radians(40.748817)) * cos(radians(latitude))
* (sin(radians((longitude - -73.985428) / 2)))^ 2)) < 750
The result of that query shows us which players are within 750 miles of one of the Scope data centers:
Compared to the initial chart of players locations, we can clearly see that some players are not within optimal range of our data center.
Adding a New Data Center
Our goal is for all players is to have a speedy experience. One option we have to alleviate the slowness is to add a data center in Topeka, Kansas.
To determine if this would give us good coverage we can add a third check to our where clause from above to include those folks that are within 750 miles of Topeka.
2 * 3960 * asin(sqrt((sin(radians((latitude - 39.056198) / 2)))
^ 2 + cos(radians(39.056198)) * cos(radians(latitude))
* (sin(radians((longitude - -95.695312) / 2)))^ 2)) < 750
Mapped out we can see that this puts a large portion of our players within 750 miles of one of our data centers:
With this data, we immediately decide to spin up a few servers in a third data center in Topeka. A day later the number of players reporting lag dropped significantly!
Having a means for measuring distance like the Haversine formula allow us to more accurately understand the impact of distance between geolocations on our business.