Common Statistical Operations
Finding Nearest Neighbors in SQL
It has been almost twenty years since Major League Baseball added a new team. Now there’s talk of one being added in the West. Should that happen, one of the many questions MLB will have to answer is: Which of the West’s three divisions makes the most sense?
Adding a Baseball Team in the West
Several cities have been considered for the next Major League Baseball team, and San Antonio is a frequent nominee. Would San Antonio get added to the National League West, with the San Diego Padres, Colorado Rockies, and San Francisco Giants, or the American League West, with the Texas Rangers, Oakland Athletics, and Seattle Mariners?
Traveling to other cities and into and out of different time zones can be exhausting. This is why travel distance between San Antonio and the other cities in the division is such an important factor.
Building a Distance Metric
We can compare the effect on travel distance by adding San Antonio to each division. The travel distance can be computed as the Euclidean distance between cities using their latitude and longitude.
Intuitively, teams like Seattle, which are geographically far from all the other teams, will cause problems with this approach. Another metric we can look at is the impact on the nearest neighbors of each team.
For each team, we’ll build the list of all the other teams, ordered by their geographical distance:
sqrt(power(source_lat - lat, 2) + power(source_long - long, 2))
Here are the 5 closest stadiums to San Antonio:
With the two other teams in Texas being our closest neighbors, we might be tempted to look to the AL West as San Antonio’s home.
Unfortunately, the Seattle Mariners are also in the AL West and they rank 29th on San Antonio’s list of nearest neighbors. The only team further away being the Boston Red Sox!
A better way to think about the effect on the neighborhood is to ask the question in the opposite direction - if a city is added to the league, where does it rank on each team’s list?
Reversing the Ranking
We can build the ranked order for each team individually. First, let us compute the distances between each pair of cities:
distances as (
sqrt(power(d.lat - s.lat, 2)
+ power(d.long - s.long, 2)) as distance
, s.team_name as src
, d.team_name as dest
, mlb_stadiums d
s.team_name != d.team_name
Now instead of ordering the entire list by distance, we want to partition the ordering by the source city:
with ranking as (
, row_number() over(partition by src order by distance asc)
select * from ranking where destination = 'San Antonio'
We can compare San Antonio’s position in the NL West:
and in the AL West:
While San Antonio has its closest neighbors in the American League, everyone in the National League has San Antonio in their top 8 closest cities. So if you’re listening MLB, our recommendation is the National League West for San Antonio!