TSE.
MathematicsFinanceHealthPhysicsEngineeringBrowse all

Computer Science · Machine Learning · Machine Learning Metrics

Confusion Matrix Calculator

Calculates all key classification metrics — accuracy, precision, recall, F1 score, specificity, and MCC — from a binary confusion matrix of TP, FP, TN, and FN values.

Calculator

Advertisement

Formula

TP (True Positives): correctly predicted positive instances. TN (True Negatives): correctly predicted negative instances. FP (False Positives): negative instances incorrectly predicted as positive (Type I error). FN (False Negatives): positive instances incorrectly predicted as negative (Type II error). Accuracy measures overall correctness. Precision measures how many predicted positives are truly positive. Recall (Sensitivity) measures how many actual positives were correctly identified. F1 Score is the harmonic mean of precision and recall, balancing both. Specificity measures how many actual negatives were correctly identified. MCC (Matthews Correlation Coefficient) provides a balanced metric even on imbalanced datasets, ranging from -1 (inverse prediction) to +1 (perfect prediction).

Source: Powers, D.M.W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2(1), 37–63.

How it works

A confusion matrix is a 2×2 contingency table that summarizes the performance of a binary classification model by comparing its predicted labels against the actual ground-truth labels. The four cells — True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) — capture every possible outcome of a binary prediction. From these four numbers, a rich suite of metrics can be derived that describe completely different aspects of classifier behaviour, from its tendency to produce false alarms to its ability to detect rare events.

Accuracy is the most intuitive metric, measuring the fraction of all predictions that were correct. However, on imbalanced datasets (e.g., 95% negative examples), a naive classifier that always predicts negative can achieve 95% accuracy while being completely useless. This is why precision, recall, specificity, F1 score, and the Matthews Correlation Coefficient (MCC) are critical complements. Precision (Positive Predictive Value) answers: of all predicted positives, how many were correct? Recall (Sensitivity, True Positive Rate) answers: of all actual positives, how many did the model find? The F1 score is the harmonic mean of precision and recall, penalising extreme imbalances between the two. Specificity (True Negative Rate) measures the model's ability to correctly reject negative instances. The MCC is widely regarded as the single most reliable metric for binary classification on imbalanced data, as it accounts for all four cells of the matrix and yields a value between -1 and +1.

These metrics are used across a vast range of domains. In medical diagnostics, recall (sensitivity) is paramount — missing a disease (FN) is typically far more costly than a false alarm (FP). In spam filtering, precision matters more — falsely flagging legitimate email as spam (FP) is annoying and costly. In fraud detection, both matter greatly. The False Positive Rate (FPR) and False Negative Rate (FNR) are also key inputs to Receiver Operating Characteristic (ROC) analysis and the construction of ROC curves, which visualise trade-offs across decision thresholds.

Worked example

Suppose a medical diagnostic model is tested on 145 patients. Of 55 patients who actually have the disease, the model correctly identifies 50 (TP = 50) and misses 5 (FN = 5). Of 90 healthy patients, the model correctly labels 80 (TN = 80) as healthy but incorrectly flags 10 (FP = 10) as diseased. Enter TP = 50, FP = 10, TN = 80, FN = 5.

Accuracy = (50 + 80) / (50 + 10 + 80 + 5) = 130 / 145 = 89.66%

Precision = 50 / (50 + 10) = 50 / 60 = 83.33% — of patients flagged as diseased, 83.33% truly are.

Recall = 50 / (50 + 5) = 50 / 55 = 90.91% — the model catches 90.91% of all actual disease cases.

Specificity = 80 / (80 + 10) = 80 / 90 = 88.89% — it correctly clears 88.89% of healthy patients.

F1 Score = 2 × (0.8333 × 0.9091) / (0.8333 + 0.9091) = 2 × 0.7576 / 1.7424 = 86.96%

MCC = (50 × 80 − 10 × 5) / √(60 × 55 × 90 × 85) = (4000 − 50) / √(25,245,000) = 3950 / 5024.4 ≈ 0.7862 — a strong positive correlation between predictions and reality.

The high MCC of 0.786 confirms the model is genuinely predictive, not just exploiting class imbalance.

Limitations & notes

This calculator is designed for binary (two-class) classification only. Multi-class problems require an N×N confusion matrix and per-class metric averaging strategies (macro, micro, weighted), which are not covered here. All metrics assume a fixed decision threshold — typically 0.5 for probabilistic classifiers — and results will vary significantly if the threshold is tuned. When any denominator equals zero (e.g., TP + FP = 0 when no positives are predicted), precision and related metrics are undefined (NaN); the calculator flags this automatically. The MCC returns 0 by convention when any row or column of the confusion matrix is entirely zero, as defined in the original literature. For highly imbalanced datasets, even a large MCC should be interpreted alongside domain context. These metrics describe average performance on a test set and do not characterise model calibration, uncertainty, or behaviour on out-of-distribution data. Ensure that the test set is representative of the deployment population to avoid misleading conclusions.

Frequently asked questions

What is the difference between precision and recall, and which matters more?

Precision measures how often a positive prediction is correct (low FP rate), while recall measures how often actual positives are found (low FN rate). Which matters more is entirely domain-dependent: in cancer screening, high recall is critical to avoid missing cases; in email spam filtering, high precision is preferred to avoid blocking legitimate emails. The F1 score balances both when neither can be prioritised alone.

Why is accuracy a misleading metric for imbalanced datasets?

On an imbalanced dataset where 99% of samples are negative, a classifier that always predicts 'negative' achieves 99% accuracy while having 0% recall — it never detects a single positive case. Metrics like precision, recall, F1, and MCC account for the distribution of classes and provide a more honest picture of model performance.

What is a good MCC value for a binary classifier?

MCC ranges from -1 to +1. An MCC of 0 indicates performance no better than random guessing. Values above 0.5 are generally considered good, above 0.7 are strong, and above 0.9 are excellent. Unlike F1 score, MCC is symmetric with respect to both classes, making it particularly reliable for imbalanced problems.

What is the relationship between recall and sensitivity, and between specificity and the false positive rate?

Recall and sensitivity are exactly the same metric — both equal TP / (TP + FN). Specificity (TNR) and the False Positive Rate (FPR) are complements: FPR = 1 − Specificity = FP / (FP + TN). These relationships are foundational to ROC curve analysis, where the TPR (recall) is plotted against the FPR across decision thresholds.

How do I choose the right classification threshold to optimise these metrics?

Most probabilistic classifiers output a score between 0 and 1, and the default threshold of 0.5 is rarely optimal. You can plot the ROC curve (TPR vs. FPR) or the Precision-Recall curve across all thresholds and choose the operating point that best matches your application's cost structure. Lowering the threshold increases recall but reduces precision; raising it does the opposite. Use this calculator to evaluate the full metric suite at each candidate threshold.

Last updated: 2025-01-15 · Formula verified against primary sources.