Confusion Matrix and Cyber Crimes- By Tanmay Chauhan

What is Confusion Matrix? A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.

I wanted to create a “quick reference guide” for confusion matrix terminology because I couldn’t find an existing resource that suited my requirements: compact in presentation, using numbers instead of arbitrary variables, and explained both in terms of formulas and sentences.

Let’s start with an example confusion matrix for a binary classifier (though it can easily be extended to the case of more than two classes):

What can we learn from this matrix?

There are two possible predicted classes: “yes” and “no”. If we were predicting the presence of a disease, for example, “yes” would mean they have the disease, and “no” would mean they don’t have the disease.
The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).
Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55 times.
In reality, 105 patients in the sample have the disease, and 60 patients do not.

Let’s now define the most basic terms, which are whole numbers (not rates):

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
true negatives (TN): We predicted no, and they don’t have the disease.
false positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

I’ve added these terms to the confusion matrix, and also added the row and column totals:

This is a list of rates that are often computed from a confusion matrix for a binary classifier:

Accuracy: Overall, how often is the classifier correct?

(TP+TN)/total = (100+50)/165 = 0.91

Misclassification Rate: Overall, how often is it wrong?

(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as “Error Rate”

True Positive Rate: When it’s actually yes, how often does it predict yes?

TP/actual yes = 100/105 = 0.95
also known as “Sensitivity” or “Recall”

False Positive Rate: When it’s actually no, how often does it predict yes?

FP/actual no = 10/60 = 0.17

True Negative Rate: When it’s actually no, how often does it predict no?

TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
also known as “Specificity”

Precision: When it predicts yes, how often is it correct?

TP/predicted yes = 100/110 = 0.91

Prevalence: How often does the yes condition actually occur in our sample?

actual yes/total = 105/165 = 0.64

Confusion matrix Errors

Confusion matrices have two types of errors: Type I and Type II.

The first way is to re-write False Negative and False Positive. False Positive is a Type I error because False Positive = False True and that only has one F. False Negative is a Type II error because False Negative = False False so thus there are two F’s making it a Type II. (Kudos to Riley Dallas for this method!)

The second way is to consider the meanings of these words. False Positive contains one negative word (False) so it’s a Type I error. False Negative has two negative words (False + Negative) so it’s a Type II error.

Cyber Crimes

Particularly in the last decade, Internet usage has been growing rapidly. However, as the Internet becomes a part of the day to day activities, cybercrime is also on the rise. Cybercrime will cost nearly $6 trillion per annum by 2021 as per the cybersecurity ventures report in 2020. For illegal activities, cybercriminals utilize any network computing devices as a primary means of communication with a victims’ devices, so attackers get profit in terms of finance, publicity and others by exploiting the vulnerabilities over the system. Cybercrimes are steadily increasing daily. Evaluating cybercrime attacks and providing protective measures by manual methods using existing technical approaches and also investigations has often failed to control cybercrime attacks. Existing literature in the area of cybercrime offenses suffers from a lack of a computation methods to predict cybercrime, especially on unstructured data. Therefore, this study proposes a flexible computational tool using machine learning techniques to analyze cybercrimes rate at a state wise in a country that helps to classify cybercrimes. Security analytics with the association of data analytic approaches help us for analyzing and classifying offenses from India-based integrated data that may be either structured or unstructured. The main strength of this work is testing analysis reports, which classify the offenses accurately with 99 percent accuracy.

Confusion Matrix and Cyber Security

In May, 2016, ProPublica published a study entitled, “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks.” [5]. The article assesses an algorithm called COMPAS which is used to predict recidivism in criminal cases. Accompanying the article is a github repository containing both the data and the data analysis methods used to assess the COMPAS algorithm’s performance.

The ProPublica article makes the following claim:

“We also turned up significant racial disparities, just as Holder feared. In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.

The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
White defendants were mislabeled as low risk more often than black defendants.”

Claims about algorithmic bias are concerning and this article has been heavily cited. After all, as a matter of fairness, we might expect errors in both directions to be equally distributed under an unbiased algorithm. ProPublica makes its allegations based on differences in the False Positive Rates and False Negative Rates in the Confusion Matrices for Black and White subpopulations.

Examining the Data Source, Broward Recidivism — COMPAS model, in the Confusion Matrix Dashboard, we verify that, indeed, at a threshold of 4, the False Positive Rate for Black defendants is .42 while for White defendants it is only .22. And the False Negative Rate for White defendants is .5 while for Black defendants it is .28.

The COMPAS algorithm assigns prediction scores with a relatively even distribution across the decile range of 1–10; unlike the Apple Snacks synthetic data, the distributions are not bell shaped. The most apparent difference between Black and White defendant populations, apart from Base Rate, is that Black Defendants are assigned approximately uniformly, while White Defendants are found to be more heavily represented at lower decile scores, decreasing steadily as decile score increases. You can approximate these distributions using the Approximately Linear Data Source dropdown of the Confusion Matrix Dashboard, and see how the terms in the Confusion Matrix respond.

Tanmay Chauhan