TSE.
MathematicsFinanceHealthPhysicsEngineeringBrowse all

Mathematics · Statistics · Hypothesis Testing

Statistical Power Calculator

Calculates the statistical power of a hypothesis test given sample size, effect size, significance level, and test type.

Calculator

Advertisement

Formula

Power (1 - β) is the probability of correctly rejecting a false null hypothesis. Here, β is the Type II error rate, z_α is the critical z-value corresponding to significance level α (one-tailed: 1.645 for α = 0.05; two-tailed: 1.96), δ is the true difference between means (effect), σ is the population standard deviation, and n is the sample size per group. For a standardized effect size d = δ/σ, the non-centrality parameter λ = d√n drives the power calculation.

Source: Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.

How it works

Statistical power is the complement of the Type II error rate (β): the probability of rejecting a false null hypothesis. A test with 80% power has a 20% chance of failing to detect a true effect — the conventional minimum acceptable threshold in most research fields. Power depends on four interrelated quantities: the effect size, the sample size, the significance level (α), and whether the test is one-tailed or two-tailed. Increasing any of the first three factors raises power; tightening α (e.g., from 0.05 to 0.01) reduces it.

For a two-sample z-test or large-sample t-test, the non-centrality parameter is λ = d√n, where d is Cohen's standardized effect size (d = δ/σ) and n is the per-group sample size. Power equals the probability that a standard normal variable exceeds the threshold z_α − λ. This calculator uses the standard normal CDF approximation (Abramowitz & Stegun 7.1.26) to evaluate that probability, giving results accurate to within ±0.0001 across typical input ranges. It also back-calculates the sample size required to achieve 80% power (z_β = 0.8416) as a planning reference.

Cohen's d benchmarks provide practical guidance: d = 0.2 is considered a small effect, d = 0.5 medium, and d = 0.8 large. Clinical and safety-critical studies often target 90% or 95% power. A priori power analysis performed before data collection is far more defensible than post hoc power computed after a non-significant result, as post hoc power is mathematically equivalent to the observed p-value and carries no additional information.

Worked example

Suppose a researcher is designing a randomized controlled trial comparing a new drug to placebo. Based on prior literature, they expect a standardized mean difference of d = 0.5 (medium effect). They plan to use a two-tailed test at α = 0.05 and recruit n = 64 participants per group.

Step 1 — Critical z-value: For a two-tailed test at α = 0.05, z_α = 1.96.

Step 2 — Non-centrality parameter: λ = d × √n = 0.5 × √64 = 0.5 × 8 = 4.0.

Step 3 — z for power: z_power = λ − z_α = 4.0 − 1.96 = 2.04.

Step 4 — Power from normal CDF: Φ(2.04) ≈ 0.9793, meaning the study has approximately 97.9% power to detect this medium effect with n = 64 per group.

Step 5 — Type II error rate: β = 1 − 0.9793 = 0.0207, a roughly 2% chance of a false negative.

Step 6 — Required n for 80% power: n = ((z_α + z_β) / d)² = ((1.96 + 0.8416) / 0.5)² = (2.8016 / 0.5)² = 5.6032² ≈ 31.4, rounded up to 32 per group. The researcher's n = 64 per group is well above this threshold, confirming the study is adequately powered.

Limitations & notes

This calculator assumes a two-sample z-test framework with equal group sizes and a known (or well-estimated) effect size. For small samples (n < 30), a t-distribution with finite degrees of freedom should be used, and power will be slightly lower than reported here. The effect size input (Cohen's d) must be estimated from prior research or pilot data — poorly chosen effect sizes are the most common source of error in power analyses. The required-n output targets exactly 80% power; for higher thresholds (90%, 95%), use the formula n = ((z_α + z_β) / d)² with z_β = 1.2816 or 1.6449 respectively. This tool does not cover non-normal outcomes, paired designs, ANOVA, chi-squared tests, or survival analyses, each of which requires its own power formula. Post hoc power analysis — computing power after collecting data — is discouraged by most statisticians and journal editors, as it provides no information beyond the p-value already observed.

Frequently asked questions

What is a good statistical power level for a study?

The widely accepted minimum is 80% power (β = 0.20), as established by Cohen (1988). Clinical trials and studies with high-stakes decisions often require 90% or even 95% power to reduce the risk of false negatives. Regulatory agencies such as the FDA typically require at least 80–90% power for confirmatory trials.

What is the difference between Type I and Type II errors?

A Type I error (false positive) occurs when you reject a true null hypothesis; its probability is α. A Type II error (false negative) occurs when you fail to reject a false null hypothesis; its probability is β. Statistical power equals 1 − β, so higher power means fewer false negatives. Researchers must balance both error types depending on the consequences of each mistake.

How do I choose the right effect size for my power analysis?

Effect size should be estimated from previous studies, meta-analyses, or domain expertise — not from your own pilot data, which can be unreliable due to small sample sizes. Cohen's benchmarks (small d = 0.2, medium d = 0.5, large d = 0.8) are useful starting points when no prior data exists. Always use the smallest effect size that would be clinically or practically meaningful, not the largest one you might hope for.

Why does increasing sample size increase statistical power?

Larger samples produce more precise estimates of the population parameter, reducing the standard error (σ/√n). This narrows the sampling distribution under the null hypothesis and shifts the non-centrality parameter λ = d√n upward, making it easier to distinguish the true effect from random noise. Power scales with √n, so quadrupling the sample size doubles the non-centrality parameter.

Is it valid to calculate power after seeing a non-significant result?

Post hoc (observed) power is strongly discouraged by statisticians because it is a deterministic function of the p-value — a non-significant result always corresponds to low observed power, providing no independent information. If your study was non-significant, report the confidence interval for the effect size instead, which gives readers the range of plausible true effects that are consistent with your data.

Last updated: 2025-01-15 · Formula verified against primary sources.