SQL is the dominant language for data analysis because most of the time, the data you&#8217;re analyzing is stored in a database. And most analysis involves a lot of filtering, grouping, and counting — actions that SQL makes very easy.



But sometimes you need to go beyond pure SQL. Some analyses require complex business logic or advanced statistics. While you can <a href="https://www.sisense.com/blog/predicting-exponential-growth-with-sql/">do advanced statistics in pure SQL</a>, it&#8217;s often a lot simpler to use Python.



This post is about starting that transition. If you&#8217;re already comfortable with SQL, and want to get started with Python, this is a look into some of the valuable transformations you can build. We&#8217;ll look at how to calculate linear regressions using Python, after using SQL to create our dataset.



We&#8217;ll use a sample video game database and uncover the relationship between how many times frequent players play the game, and how much money they spend in-game.



<h2 class="wp-block-heading">Generating The Dataset: Gameplays vs. Spend</h2>



The database has a gameplays table, with one row for every time a player plays the game. Our first CTE will generate [user_id, num_plays]:



<pre class="wp-block-code"><code>with
 user_plays as (
 select
 user_id
 , count(1) as num_plays
 from
 gameplays
 group by
 1
 )</code></pre>



The database also has a purchases table, with one row for every purchase. The second CTE will generate [user_id, total_spent]:



<pre class="wp-block-code"><code>, user_spend as (
 select
 user_id
 , sum(price) as total_spent
 from
 purchases
 group by
 1
 )</code></pre>



Now we&#8217;ll join those two CTEs together, making the dataset we&#8217;ll use for our first correlation [num_plays, total_spent]. We&#8217;ll restrict the dataset to players who have played at least 150 games to focus on frequent players:



<pre class="wp-block-code"><code>select
 num_plays
 , total_spent
from
 user_plays
 join user_spend using (user_id)
Where
 num_plays &gt; 150</code></pre>



The output looks like this:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Table-1.png" alt=" Gameplays vs. Spend chart" class="wp-image-74876"/></figure>



<h2 class="wp-block-heading">The Hello World of Linear Regressions in Python</h2>



There are <a href="https://dtdocs.sisense.com/article/r-and-python">many Python libraries that help with data analysis</a>. We&#8217;ll start with <a rel="noreferrer noopener" aria-label=" (opens in a new tab)" href="https://seaborn.pydata.org/" target="_blank">seaborn</a> and use the easiest way to make a linear regression, a <a rel="noreferrer noopener" aria-label=" (opens in a new tab)" href="https://seaborn.pydata.org/generated/seaborn.jointplot.html" target="_blank">jointplot</a>. As a bonus, this plot type also comes with histograms. 



Just import seaborn and pass the data frame generated from the SQL query to jointplot:



<pre class="wp-block-code"><code>import pandas as pd
import seaborn as sns

sns.jointplot(x='num_plays', y='total_spent', data=df, kind='reg')</code></pre>



Which generates our linear regression:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Linear-Regression-image.png" alt="Linear regression" class="wp-image-74886"/></figure>



Easy! As we could have predicted, players who play more also spend more, in general.



However, scatter plots aren&#8217;t the right plot type to show dense clusters of information since they hide density — many data points could be hidden behind a single dot. In these cases, hex binning can tell the data’s story more effectively.



With hex binning, the plot area is divided into equally sized hexagons, and the color shading of each hexagon is based on how many data points fall within that hexagon&#8217;s boundaries.



Why hexagons and not another shape? There are three regular shapes that can tessellate, or cover a surface without gaps or overlaps, in 2D plots. Squares and triangles are the other two. Hexagons are generally preferred for binning because they are closest to a circle (compared to triangles and squares). Circles are most representative of a &#8220;bin&#8221; because circles have the minimum distance between their borders and the center point among 2D shapes, which minimizes outliers in the bin.



With seaborn, it&#8217;s easy to change from a scatter jointplot to one that uses hex binning. Simply change, the kind parameter to &#8216;hex&#8217;. While we&#8217;re changing things, let&#8217;s also change the color from blue to magenta with color=&#8217;m&#8217;:



<pre class="wp-block-code"><code>sns.jointplot(x='num_plays', y='total_spent', data=df, kind='hex', color='m')</code></pre>



 Making our new plot look like this: 



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/New-Plot.png" alt="New plot" class="wp-image-74891"/></figure>



With this plot type, it&#8217;s easy to see where the density of data points varies, which we couldn&#8217;t tell from the scatter above.



Segmenting Our Dataset into Multiple PlotsThis video game is multi-platform, so let&#8217;s use Python to make a separate linear regression for each platform: Web, Android and iOS. First, we&#8217;ll update our query to include platform, in both CTEs and in the outputted data: 



<pre class="wp-block-code"><code>with
 user_plays as (
 select
 user_id
 , platform
 , count(1) as num_plays
 from
 gameplays
 group by
 1
 , 2
 )
 , user_spend as (
 select
 user_id
 , platform
 , sum(price) as total_spent
 from
 purchases
 group by
 1
 , 2
 )
select
 num_plays
 , total_spent
 , platform
from
 user_plays
 join user_spend using (user_id, platform)
where
 num_plays &gt; 150</code></pre>



Which has this output:&nbsp;



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Table-2.png" alt="Segmenting Dataset into Multiple Plots" class="wp-image-74881"/></figure>



And, still using seaborn, we&#8217;ll switch from jointplot to <a href="https://seaborn.pydata.org/generated/seaborn.lmplot.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">lmplot</a>. The lmplot function makes it easy for us to make one plot for each of our platforms. We no longer need the kind argument, instead we pass in the column to segment by, col=&#8217;platform&#8217;, and also tell lmplot to make each platform a different color using hue=&#8217;platform&#8217;:



<pre class="wp-block-code"><code>sns.lmplot(x='num_plays', y='total_spent', data=df, col='platform', hue='platform')</code></pre>



We&#8217;ve created three plots at once:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/Three-Plots.png" alt="Three plots" class="wp-image-74898"/></figure>



Segmenting out the data was informative! Web and iOS players have very different play count and spend distributions.



<h2 class="wp-block-heading">Bonus Lap: 3D Scatter Plots</h2>



Sometimes it&#8217;s helpful to plot a third variable to shed more light on the distribution. In our case, we&#8217;ll include the number of purchases a player has made to tease out if we&#8217;re looking at few large purchases vs. many small purchases. In the user_spend CTE we&#8217;ll add count(1) as num_purchases to the select clause and include that column in the final SQL output as well.



For 3D scatters, we will use <a href="https://matplotlib.org/contents.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">matplotlib</a> instead of seaborn. First, we&#8217;ll import the library and set up the 3D context:



<pre class="wp-block-code"><code>import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

ax = plt.figure().add_subplot(111, projection='3d')</code></pre>



Unlike seaborn, matplotlib won&#8217;t auto-color the different platforms for us. So we&#8217;ll add a new column to our dataframe that maps Android to orange, Web to blue and iOS to green:



<pre class="wp-block-code"><code>df&#91;'colors'] = df&#91;'platform'].replace({ 'android': 'orange', 'web': 'b', 'iOS': 'g'})</code></pre>



And now for the fun part, making the 3D scatter! Seaborn&#8217;s arguments were column names and the whole data frame. With matplotlib, we pass in each column whole (as x, y then z) and a fourth parameter sets the colors. After, we&#8217;ll label the axes:



<pre class="wp-block-code"><code>ax.scatter(df&#91;'num_plays'], df&#91;'total_spent'], df&#91;'num_purchases'], c=df&#91;'colors'])
ax.set_xlabel('Number of Plays')
ax.set_ylabel('Total Spent')
ax.set_zlabel('Number of Purchases')</code></pre>



Running this generates our 3D scatter plot:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/3D-scatter-plots.png" alt="3D scatter plots" class="wp-image-74903"/></figure>



While they look neat, 3D plots are often useless if they aren&#8217;t part of an animation or have some way of letting users move the perspective. Without that, a 2D rendering of a 3D plot can make it very difficult to see where the points actually are in the space.



<h2 class="wp-block-heading">Onward!</h2>



With just a few lines of Python, it&#8217;s easy to build on your SQL expertise to generate analyses that benefit from advanced statistics, especially when those statistics are inconvenient to calculate in SQL. Another benefit of using Python to visualize statistics is that you&#8217;re not tied to whatever built-in visualizations are available in your SQL environment.



Of course, this is just the beginning. Python makes it easy to include complex business logic, more advanced statistics or more advanced visualizations. Sisense for Cloud Data Teams supports <a href="https://dtdocs.sisense.com/article/r-and-python" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">dozens of R and Python libraries made for data analysis and visualization</a>, ready and waiting for your next data project!

Getting Started with Data Analysis in Python After Using SQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article