Precision Recall and ROC Curves for Pregnancy Tests
I’m a data nerd who hates taking chances, so it’s only natural that I’d use this opportunity to talk about analyzing risk in SQL using a major life milestone for many women: the pregnancy test. Whether you want to be pregnant or not, having an incorrect prediction is not something you want to risk.
Precision-Recall curves and ROC curves are frequently used to measure algorithm performance in machine learning and diagnostic healthcare. We can use these tools to measure the predictive accuracy of a model given that the model uses 'binary classification'. Tasks with binary classification can be sorted into two groups on the basis of a classification rule - 'relevant or not relevant', 'true or not true', or in this case ‘pregnant or not pregnant’.
Precision measures the relevancy of the result set - what fraction of the data was in our target group.
Recall measures what proportion of the relevant set was returned - how much of what we were looking for that we found.
As an example, the results of pregnancy tests can have four classifications:
- True positive: a woman is pregnant and is predicted as pregnant
- True negative: a woman is not pregnant and is predicted as not pregnant
- False positive: a woman is not pregnant but is predicted as pregnant, also known as a 'Type 1' error
- False negative: a woman is pregnant but is predicted as not pregnant, also known as a 'Type 2' error
Now imagine if 400 women are tested: 200 are pregnant, 200 are not. If a pregnancy test predicts that 100 women out of the sampled 400 are pregnant, but only 80 out of that 100 predicted are truly pregnant, we can measure precision and recall.
Precision tells us the proportion of the women we’ve predicted as pregnant who are actually pregnant. In this case 80/100 or 80%.
Precision = True Positive * (True Positive + False Positive)
Recall tells us the proportion of women who were pregnant who we accurately predicted were pregnant. In this case it is 80/200 or 40%.
Recall = True Positive * (True Positive + False Negative)
Most pregnancy tests determine whether or not a woman is pregnant by detecting the presence of human chorionic gonadotropin (hCG), a hormone produced by the placenta following implantation. Twenty-five mIU/ml of hCG in the blood is a normal level during the sixth week of pregnancy. However, during the third or fourth week a pregnant woman may have concentrations as low as 5 mIU/ml hCG. Moreover, women who are not pregnant may also exhibit low levels of hCG. hCG can be produced by the pituitary gland in menopausal women as well as by women taking certain prescription medications. With this in mind, at what concentration of hCG should a woman be declared pregnant? A pregnancy test that uses the concentration of hCG as a classifier must find the optimal balance between precision and recall.
High precision signifies that an algorithm returns substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results. For example, a pregnancy test that uses a classifier of 50 mIU/mL of hCG will have high precision and low recall: they will have very few false negatives but will only capture a small portion of pregnant women. A pregnancy test that uses a classifier of 5 mIU/mL of hCG will have high recall and low precision: they will capture almost the entire proportion of pregnant women but will also capture many who are not pregnant, resulting in false positives. We can use Precision-recall and ROC to compute what hCG concentration is the optimal classifier.
Calculating Precision-Recall in SQL
Now let’s look at the following data. Using hCG concentration as our classifier, we predict whether or not a woman is pregnant and compare that to her actual diagnosis. If we were to sample 1000 women using different concentrations and record their results, we might arrive at a table like:
Using our formulas above, we can use SQL to easily calculate the precision and recall.
, true_positive * 1.0 / (true_positive + false_positive) as precision
, true_positive * 1.0 / (true_positive + false_negative) as recall
Plotting the precision-recall curve, we can confirm that there is a general trendline where the lower the threshold, the greater the recall and the lower the precision. However, we also see that there is a point in the graph, where we can optimize recall without lowering our precision. For example, around 0.2 - 0.86 recall, we keep a fairly consistent precision of 0.5. We want to choose the point where we have the highest recall without sacrificing precision, which is around a recall of 0.87.
We can then cross-reference our precision-recall curve with our hCG level to find the concentration at which we’re getting 0.56 precision and 0.86 recall: 40 mIU/mL hCG.
Thus, using our precision-recall curve, we’re able to find a pretty optimal concentration that we should be using to maximize both precision and recall.
Calculating ROC in SQL
When the response of a diagnostic test is continuous or on an ordinal scale with a minimum of 5 categories we can also measure the sensitivity and specificity to create an ROC curve. A ROC curve is the plot that reveals the full trade-off between the sensitivity and (1-specificity) across a series of cut-off points. Sensitivity, like precision, measures the conditional probability of correctly identifying positive characteristics. It can often be referred to as the true positive test. Specificity, on the other hand, is the true negative test and measures the conditional probability of correctly identifying negative characteristics.
The area under a ROC curve is an effective measure of the validity of a diagnostic test. In a clinical setting it is useful for finding the optimal cut-off to least misclassify diseased or non-diseased subjects, comparing the efficacy of two or more tests for assessing the same disease, and comparing two or more observers measuring the same test.
We can again use SQL to measure both our sensitivity and our 1 - specificity.
, true_positive * 1.0 / (true_positive + false_negative) as sensitivity
, 1 - (true_negative * 1.0 / (true_negative + false_positive)) as “1- specificity”
Using those results, we can measure our true negative rate against our true positive rate. Plotting our hCG levels on the Y2-axis, we can now observe that lowering our threshold to 5 mIU/mL achieves high true positive and false positive rates. Raising the threshold to 50 mIU/mL gives us both low true positive and low false positive rates.
Once again, let’s take this one step further and use our ROC curve to compare the efficacy of two tests for assessing the same condition. As any pregnant woman (or her partner) could tell you, there are many hormones involved when creating a human. hCG is the key hormone that tells the ovaries to shut off ovulation. Progesterone levels rise rapidly in the first trimester, causing the uterine lining to thicken. Estrogen starts to rise rapidly at the end of the first trimester, playing a key role in the development of the fetus.
Let’s take this one step further and use our ROC curve to compare the efficacy of two tests for assessing the same condition. As any pregnant woman (or her partner) could tell you, there are many hormones involved during pregnancy. hCG is the key hormone that tells the ovaries to shut off ovulation. Progesterone levels rise rapidly in the first trimester, causing the uterine lining to thicken. Estrogen starts to rise rapidly at the end of the first trimester, playing a key role in the development of the fetus.
Imagine if a pharmaceutical company decides it now wants to use progesterone levels rather than hCG levels as its classifier. How do we measure if this new classifier should replace the classic hCG? We can plot the two ROC Curves and compare the area under the curves.
The area under the curve measures discrimination, or the test’s ability to correctly classify those with and without the condition. Imagine if we have a group of women already correctly classified as pregnant or non-pregnant. We randomly pick one from the pregnant group and one from the non-pregnant group and run the pregnancy test on both. The patient predicted as ‘non-pregnant’ should be the one from the non-pregnant group. The area under our ROC curve is the percentage of randomly drawn pairs that the test can correctly classify or discriminate.
At a glance, the progesterone curve appears to have a greater area than the hCG curve, which would signify greater accuracy. We do see a point, however, where the two curves cross and we will have to make a decision.
The choice of sensitivity and specificity depend upon clinical setting. For example, to diagnose serious diseases such as cancer in a high risk group tests with greater sensitivity are preferred even if the false positive rate is high, as having false negatives is extremely dangerous. On the other hand, when screening a low risk group where a positive diagnosis would involve an invasive and costly treatment, high specificity would be preferred. Thus using ROC curves to determine the optimal cut-off point of a classifier requires subjective understanding of the risks involved with a misdiagnosis.
Want to discuss this article? Join the Periscope Data Community!