No items found.
Product

Diabetes Prediction Using Support Vector Machines

In a previous post, we learned what Machine Learning (ML) classification problems are, we saw how Naive Bayes was used to solve the classification problem of sentiment analysis — detecting whether text is positive or negative. In this post, we are going to learn about Support Vector Machines (SVM), another popular technique used for classification problems. We are going to use this technique to predict whether someone is likely to have diabetes using predictor factors such as age, number of pregnancies, insulin levels, glucose levels, and more. 

Diabetes is a chronic illness affecting many people and is characterized by the presence of high blood sugar levels. Early detection is important since diabetes detected in early stages can be controlled by lifestyle changes and/or minimal medication. Diabetes prediction serves as a useful reference for doctors because they can order further tests to detect diabetes early.

Preparing Our Training Data

The training data we are going to use for this problem is the Pima Indian Diabetes database. The dataset contains several predictor factors for diabetes and an outcome. The outcome  indicates whether the person has diabetes (1) or not (0). In ML terms, these predictor factors are called features. 

As usual, the first step in the ML process is preparing the training data. The dataset is available as CSV, so we can import the data using our CSV upload feature. Once the data has been imported, it needs to be filtered to include only the relevant features for training and cleaned to remove duplicates and missing data. This is best done using SQL, the most popular language for data analysts. Here’s the SQL I used to prepare this dataset for ML analysis:

select
 PREGNANCIES
 , GLUCOSE
 , BLOODPRESSURE
 , INSULIN
 , BMI
 , AGE
 , OUTCOME
from
 [pima_indian_diabetes]
where
 PREGNANCIES is not null
 and GLUCOSE is not null
 and BLOODPRESSURE is not null
 and INSULIN is not null
 and BMI is not null
 and AGE is not null
 and OUTCOME is not null

The result of that query is a table like this:

Once the data has been cleaned, we will use it as our training data. It’s ready to be fed into our ML algorithm (Support Vector Machine) to build our model. Before we do that, let me explain Support Vector Machines. 

Understanding Support Vector Machines

N-Dimensional Hypercubes 

In order to understand SVM, we need to understand what a N-dimensional hypercube is. To explain that concept, we’ll start with shapes and geometry. 

A point is a 0-dimensional shape. It has no axis and no size.

A line is a 1-dimensional shape. There is a single axis. A point on a line is represented by a single variable (x), which represents the distance of the point from some origin.

A square is a 2-dimensional shape. There are 2 axes. A point on a square is represented by two variables (x, y), where (x) represents the distance of the point on X-axis and (y) represents the distance of the point on Y-axis.

A cube is a 3-dimensional shape. There are 3 axes. A point on a cube is represented by three variables (x, y, z), where (x) represents the distance of the point on the X-axis, (y) represents the distance of the point on the Y-axis, and (z) represents the distance of the point on the Z-axis.

Extending this idea, an N-dimensional hypercube is an N-dimensional shape. There are N axes. A point on this hypercube is represented by N number of variables. Our human eyes are only capable of visualizing up to three dimensions, so we are going to have to imagine this shape.

Dividing shapes

The second concept of dividing our N-dimensional hypercube starts with the concept that a line can be divided into two sections using a point.

A square can be divided into two sections using a line.

A cube can be divided into two sections using a 2-D plane.

Extending this idea, an N-dimensional hypercube can be divided into two sections using a (N-1) dimension hyperplane

Training Phase

In the training phase, the SVM algorithm first draws an N-dimensional hypercube by representing each feature as a separate dimension. It then uses the numerical values of those features to plot points on the N-dimensional hypercube. It then attempts to find a boundary that separates the two classes of data — points where outcome is 0 (no diabetes) and points where outcome is 1 (diabetes), for example. The boundary is a (N-1) dimension hyperplane. 

Here is an example boundary (a line) when there are two features.

Here is an example boundary (a 2D plane) when there are three features.

If there are more than two classes of data, then the SVM algorithm draws more hyperplanes.

Testing Phase

In the testing phase, we can start with real-time data about a patient such as age, number of pregnancies, insulin levels, and so on. The SVM algorithm determines a 1/0 outcome about diabetes based on which side of that boundary the data falls on. 

Support Vectors

Why is this algorithm called Support Vector Machines?  To accurately classify all the data points, the SVM algorithm needs to find the optimum hyperplane between the two classes. The optimum hyperplane is the one that maximizes the margin between the two classes. The data points, also known as vectors, that lie closest to the hyperplane are called Support Vectors, which gives the name Support Vector Machines to the algorithm. 

Support Vectors are the most important data points of the training dataset. If these data points are removed from the training dataset, the position of the dividing hyperplane would change. They are also the data points that are the most difficult to classify.

An ideal SVM analysis produces a hyperplane that perfectly separates the data points into two non-overlapping classes, like in the picture above. However, perfect separation is not always possible. Perfect separation may result in a model that performs many misclassifications. In these situations, the SVM finds the hyperplane that maximizes the margin and minimizes the misclassifications.

The simplest way to separate data into two classes is through a straight line when there are 2 features or 2-D plane when there are 3 features or N-D hyperplane when there are (N+1) features. These separations are called linear separations. There are many situations where a non-linear region can separate the data more efficiently with fewer misclassifications. SVM can handle these cases using non-linear kernel functions. The most common of these is RBF (Radial Basis Functions). Others are polynomial and sigmoid kernel functions. While performing deep analysis, it is important to try different kernel functions and pick the one that provides the best results for the training data. 

Below is an example where non-linear separation performs better than any linear separation.

Applying Support Vector Machines

The next step is to build our model using Support Vector Machines. The output of the SQL query above is available as a dataframe (df). Skikit-learn package has an algorithm for SVM and we import it. The code for building our model is below. We select the features we want to include and pass that along with the outcomes to the fit method of SVC (Support Vector Classifier). This builds the model. Note that we are using the linear kernel function.

# SQL output is imported as a dataframe variable called 'df'
import pandas as pd
from sklearn import svm

outcomes = df['OUTCOME']
features = df[['PREGNANCIES', 'GLUCOSE', 'BLOODPRESSURE', 'INSULIN', 'BMI', 'AGE']].as_matrix()
model = svm.SVC(kernel='linear')
model.fit(features, outcomes)

After that Python code has been run, we are ready to test our model. Values can be entered manually in the Python code or automatically by setting up filters to pass values from the dashboard.

Once the filters are set up, modify the SQL to pass the input values from a filter into Python code.

select
 PREGNANCIES
 , GLUCOSE
 , BLOODPRESSURE
 , INSULIN
 , BMI
 , AGE
 , OUTCOME
 , '[INPUT_PREGNANCIES]' AS INPUT_PREGNANCIES
 , '[INPUT_GLUCOSE]' AS INPUT_GLUCOSE
 , '[INPUT_BLOOD_PRESSURE]' AS INPUT_BLOOD_PRESSURE
 , '[INPUT_INSULIN]' AS INPUT_INSULIN
 , '[INPUT_BMI]' AS INPUT_BMI
 , '[INPUT_AGE]' AS INPUT_AGE
from
 [pima_indian_diabetes]
where
 PREGNANCIES is not null
 and GLUCOSE is not null
 and BLOODPRESSURE is not null
 and INSULIN is not null
 and BMI is not null
 and AGE is not null
 and OUTCOME is not null

In the Python code, we reference the values passed from the dashboard through the filters.

result = model.predict([[df['INPUT_PREGNANCIES'][0], df['INPUT_GLUCOSE'][0], df['INPUT_BLOOD_PRESSURE'][0], df['INPUT_INSULIN'][0], df['INPUT_BMI'][0], df['AGE'][0]]])
periscope.text('DIABETES') if result == 1 else periscope.text('NO DIABETES')

This allows us to invoke the diabetes predictor by supplying values directly from the dashboard.

Visualizing the Hyperplane and Support Vectors

Since we cannot visualize data when there are so many dimensions, let’s pick only 2 dimensions for visualizing our hyperplane — insulin levels and age . We can filter the data to only include patients who are at least 30 years old and have serum insulin levels over 350 mu U/ml to obtain a separation without misclassifications for illustrative purposes.

select
 PREGNANCIES
 , GLUCOSE
 , BLOODPRESSURE
 , INSULIN
 , BMI
 , AGE
 , OUTCOME
 , '[INPUT_PREGNANCIES]' as INPUT_PREGNANCIES
 , '[INPUT_GLUCOSE]' as INPUT_GLUCOSE
 , '[INPUT_BLOOD_PRESSURE]' as INPUT_BLOOD_PRESSURE
 , '[INPUT_INSULIN]' as INPUT_INSULIN
 , '[INPUT_BMI]' as INPUT_BMI
 , '[INPUT_AGE]' as INPUT_AGE
from
 [pima_indian_diabetes]
where
 PREGNANCIES is not null
 and GLUCOSE is not null
 and BLOODPRESSURE is not null
 and INSULIN is not null
 and BMI is not null
 and AGE is not null
 and OUTCOME is not null
 and INSULIN > 350
 and age > 30
limit 10

Now let’s plot insulin levels vs. age and see what the visualization looks like. The code for that analysis is below.

import pandas as pd
import seaborn as sns

data_plot = sns.lmplot('INSULIN', 'AGE', data=df, hue='OUTCOME', fit_reg=False)
periscope.image(data_plot)

The output is a chart like this:

Next, let’s use the piece of code below to draw the separating hyperplane and parallels that pass through the Support Vectors for this data.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import svm
import numpy as np

data_plot = sns.lmplot('INSULIN', 'AGE', data=df, hue='OUTCOME', fit_reg=False)

outcomes = df['OUTCOME']
features = df[['INSULIN', 'AGE']].as_matrix()
model = svm.SVC(kernel='linear')
model.fit(features, outcomes)

# Plot the separating hyperplane
w = model.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(30, 800)
yy = a * xx - (model.intercept_[0]) / w[1]
plt.plot(xx, yy, linewidth=2, color='black')

# Plot the parallels to the hyperplane that pass through the support vectors
b = model.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = model.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
           s=80, facecolors='none')

# Plot the data points
periscope.image(data_plot)

This results in the below hyperplane and parallels (dotted lines) passing through the Support Vector.

Summary

Using a few lines of SQL, we have prepared our training diabetes data to be analyzed; using a few lines of Python, we have trained a model that is capable of predicting whether a person is likely to have diabetes, providing an efficient means to utilize medical resources to identify and treat the highest percentage of patients with diabetes. This shows the power of today’s most advanced data analysis tools. Periscope Data by Sisense supports dozens of R and Python libraries made for data analysis and visualization, ready and waiting for your next data project!

Tags: 

Want to discuss this article? Join the Periscope Data Community!

Govind Rajagopalan
Govind is Senior Engineering Manager at Sisense. He is a passionate software professional living in the East Bay with his wife and daughter. He is excited to teach, help his teammates thrive and have fun improving his craft. In his leisure time, he enjoys an outing with his family, hiking, and exploring parks around the Bay Area.