TSE.
MathematicsFinanceHealthPhysicsEngineeringBrowse all

Computer Science · Machine Learning · Statistical Learning

Naive Bayes Classifier Calculator

Calculates posterior class probabilities using Bayes' theorem with the naive conditional independence assumption across features.

Calculator

Advertisement

Formula

P(C_k | x_1,...,x_n) is the posterior probability of class C_k given features x_1 through x_n. P(C_k) is the prior probability of class k. P(x_i | C_k) is the likelihood of feature x_i given class C_k. The denominator P(x_1,...,x_n) is the evidence, which acts as a normalizing constant across all classes. Because the denominator is constant for all classes, classification reduces to argmax of P(C_k) times the product of likelihoods.

Source: Mitchell, T. (1997). Machine Learning. McGraw-Hill. Chapter 6: Bayesian Learning.

How it works

Bayes' theorem forms the mathematical core of this classifier. Given an observation described by features x₁, x₂, ..., xₙ, the goal is to find the class C_k with the highest posterior probability P(C_k | x₁,...,xₙ). By Bayes' theorem, this posterior is proportional to the prior P(C_k) multiplied by the likelihood of observing the features under that class. The 'naive' independence assumption allows the joint likelihood to be factored into a simple product of individual feature likelihoods P(x₁|C_k) × P(x₂|C_k) × ... × P(xₙ|C_k), drastically reducing computational complexity and required training data.

The full formula is P(C_k | x) = P(C_k) × ∏ P(xᵢ | C_k) / P(x), where P(x) is the marginal probability of the observed feature vector — a constant for all classes that serves only as a normalizing factor. For classification, we compare the unnormalized scores across classes and predict the one with the highest value. When probabilities become very small with many features, practitioners use log-space arithmetic: log P(C_k) + Σ log P(xᵢ | C_k), which converts products into sums and avoids floating-point underflow.

Naive Bayes comes in several variants depending on the data type. The Bernoulli model suits binary features. The Multinomial model is standard for word count text data. The Gaussian model handles continuous features by modeling each likelihood as a normal distribution. This calculator uses the discrete likelihood variant, where likelihoods for each feature are entered directly as probabilities. Real-world applications include email spam detection (pioneered by Paul Graham's spam filter), news article topic classification, medical symptom-based disease diagnosis, and real-time sentiment analysis in NLP pipelines.

Worked example

Consider a spam email classifier with two classes: Spam (C₁) and Not Spam (C₂). Historical data shows that 40% of emails are spam, so P(C₁) = 0.4 and P(C₂) = 0.6.

An incoming email contains three features: the word 'free', a link, and an all-caps subject line. The estimated likelihoods from training data are:

Feature 1 — word 'free': P(x₁|Spam) = 0.7, P(x₁|NotSpam) = 0.3
Feature 2 — contains link: P(x₂|Spam) = 0.6, P(x₂|NotSpam) = 0.5
Feature 3 — all-caps subject: P(x₃|Spam) = 0.4, P(x₃|NotSpam) = 0.8

Step 1 — Unnormalized scores:
Score(Spam) = 0.4 × 0.7 × 0.6 × 0.4 = 0.0672
Score(NotSpam) = 0.6 × 0.3 × 0.5 × 0.8 = 0.0720

Step 2 — Evidence (normalizing constant):
P(x) = 0.0672 + 0.0720 = 0.1392

Step 3 — Posterior probabilities:
P(Spam | x) = 0.0672 / 0.1392 = 0.4828 (48.3%)
P(NotSpam | x) = 0.0720 / 0.1392 = 0.5172 (51.7%)

Decision: The classifier predicts Not Spam since P(C₂|x) > P(C₁|x). Note that despite strong spam signals from the word 'free', the all-caps feature is much more common in legitimate emails in this dataset, pulling the posterior toward Not Spam.

Limitations & notes

The conditional independence assumption is the most significant limitation of this classifier. In practice, features are often correlated — for example, the presence of the word 'free' and the word 'money' in an email are not independent events, and treating them as such can distort probability estimates. However, even with violated independence, the classifier often still identifies the correct class. A second limitation is the zero-frequency problem: if a feature value never appears with a given class in the training data, its likelihood is zero, which zeroes out the entire product regardless of all other features. This is addressed using Laplace (additive) smoothing, which adds a small pseudo-count to all feature counts. Additionally, Naive Bayes produces probability estimates that can be poorly calibrated — the posterior values may be very close to 0 or 1 even when true confidence should be moderate. Calibration techniques like Platt scaling or isotonic regression are used post-hoc. Finally, this calculator operates on only two classes and three features; real-world implementations extend trivially to K classes and N features.

Frequently asked questions

Why is Naive Bayes called 'naive'?

The name refers to the naive assumption that all input features are conditionally independent of each other given the class label. In reality, features are almost never truly independent, but the algorithm still performs well in many domains because the class ranking is often preserved even when probability values are distorted.

How do I get the likelihood probabilities to input into this calculator?

Likelihoods are estimated from labeled training data. For each class, count how many training examples in that class exhibit each feature value, then divide by the total number of examples in that class. For continuous features, fit a Gaussian distribution and evaluate its probability density at the observed value.

What is the zero-frequency problem and how is it fixed?

If a feature never appears in training examples for a given class, its likelihood becomes zero, making the entire product zero regardless of other evidence. Laplace smoothing (also called additive smoothing) adds a small constant — typically 1 — to the count of every feature-class combination, ensuring no probability is exactly zero.

Why are log scores useful in Naive Bayes?

When multiplying many small probabilities together, the result can underflow to zero in floating-point arithmetic. Taking logarithms converts the product into a sum — log P(C) + Σ log P(xᵢ|C) — which is numerically stable and computationally cheaper. The class with the highest log score is identical to the class with the highest posterior probability.

How does Naive Bayes compare to logistic regression for classification?

Naive Bayes is a generative model that models the joint distribution P(C, x), while logistic regression is a discriminative model that directly models P(C | x). Logistic regression generally achieves higher accuracy when given enough data, but Naive Bayes trains faster, requires far less data, and performs competitively or better when training samples are scarce or when the independence assumption approximately holds.

Last updated: 2025-01-15 · Formula verified against primary sources.