Bayesian screening · interactive

See the math behind testing & treatment

Watch a population flow through a test and a treatment into outcomes. Change any input and every view updates live — from false positives and predictive value to who is actually helped or harmed. No verdicts, just the numbers.

</> Embed this tool

Drop this into any blog, course page, or LMS — it embeds the live tool, preloaded with the current scenario:

About this tool — what it is and how to read it

Screening and treatment decisions hinge on a few numbers — how common a condition is, how accurate the test is, and how much the treatment helps or harms. Those numbers interact in ways that are easy to misjudge, especially when a condition is rare.

This is a tool to review those numbers. Set a population, a disease prevalence, a test's sensitivity and specificity, and a treatment's benefit and harm, then watch a single cohort flow from population → test → treatment → outcome — true and false positives, predictive value, and how many people end up helped, harmed, or unchanged. Plug in your own figures, or use the cited examples next to each slider.

Key terms

Prevalence (pre-test probability)
How common the condition is in the group being tested. The starting point for everything downstream.
Sensitivity
Of people who have the condition, the share the test correctly flags positive.
Specificity
Of people who do not have it, the share the test correctly clears.
PPV / NPV
Given a positive (or negative) result, the chance it is right. Unlike sensitivity/specificity, these depend heavily on prevalence.
Likelihood ratio (LR)
How much a result shifts the odds. LR+ for a positive, LR− for a negative — and they do not depend on prevalence.
NNT / NNH
Number needed to treat for one person to benefit; number needed to harm for one to be harmed.
NNS
Number needed to screen for one person to be helped — screening, testing, treatment, and outcomes all folded together.

The cascade

A cohort of 1,000 flows left to right: who has the disease, what the test says, who gets treated, and how they end up. Band height = number of people.

PopulationDiseased: 10Healthy: 990Healthy · 990Test resultTrue +: 9False −: 1False +: 99False + · 99True −: 891True − · 891TreatmentTreated: 108Treated · 108Not treated: 892Not treated · 892OutcomeHelped: 1Harmed: 5No change: 102No change · 102Missed: 1Cleared: 891Cleared · 891Diseased · 10True + · 9False − · 1Helped · 1Harmed · 5Missed · 1

What happens to everyone

Each square is a person, colored by how they end up. One square per person.

The test

2×2 confusion matrix

Counts for 1,000 people. Sensitivity reads across the disease row; PPV reads down the test-positive column — they look at the table from perpendicular directions.

Test +Test −Total
Disease +TP9FN110
Disease −FP99TN891990
Total1088921,000
How each measure is built from the four cells
  • Sensitivity TP ÷ (TP + FN) disease + row →
  • Specificity TN ÷ (TN + FP) disease − row →
  • PPV TP ÷ (TP + FP) test + column ↓
  • NPV TN ÷ (TN + FN) test − column ↓
  • LR+ sensitivity ÷ (1 − specificity)
  • LR− (1 − sensitivity) ÷ specificity
PPV (if test +)
8.3%
NPV (if test −)
99.9%
Sensitivity
90.0%
Specificity
90.0%
LR+
9.0
LR−
0.11

Probability tree

Split the population by disease, then by test result. Each path multiplies to a joint probability; Bayes just compares the two test-positive leaves.

1.0%99%sens 90%10%10%spec 90%Everyone1,000Disease +Disease −True positive0.90% · 9False negative0.10% · 1False positive9.9% · 99True negative89.1% · 891

P(disease | test +) = 0.90% ÷ (0.90% + 9.9%) = 8.3%

How PPV collapses with prevalence

Holding sensitivity and specificity fixed, the value of a positive result depends almost entirely on how common the disease is. The dashed line marks the current prevalence.

0%25%50%75%100%0.01%0.10%1%10%50%Disease prevalence (log scale)Predictive value
PPV — 8.3% NPV — 99.9%

The base-rate fallacy: at 1.00% prevalence, even a 90%/90% test makes a positive result correct only 8% of the time. Accuracy isn't the whole story — the base rate is.

Fagan nomogram

The likelihood ratio is the lever that turns a pre-test probability into a post-test one — and it doesn't depend on prevalence. Line shown for a positive result (LR+).

Pre-testLRPost-test0.1%1%5%10%20%50%80%90%99%0.1%1%5%10%20%50%80%90%99%0.010.11101001000
Pre-test 1.00% × LR+ 9.0 → Post-test 8.3%

This is Bayes' theorem, in odds form: prior odds × likelihood ratio = posterior odds. The LR is the strength of the evidence (LR+ > 10 “rules in”, LR− < 0.1 “rules out”).

The treatment

Treatment outcomes

108 interventions performed — 99 on people who never had the disease and so could not benefit.

Helped
1
Harmed
5
Treated, no change
102
Missed (untreated)
1
Number needed to screen
1,112
NNT (input)
10
NNH (input)
20
ARR (= 1 / NNT)
10.0%

Per 1,000 screened: about 1 helped, 5 harmed by treatment, 99 false alarms, and 1 missed.

Watch the relative-vs-absolute trap: a large “relative risk reduction” can still mean a large NNT when the baseline risk is low. Benefit (NNT) only means something next to its harms — that's why helped and harmed are always shown on the same denominator here.

Repeat testing

Serial testing — the false-alarm pile-up

Repeat a test on a healthy person and the chance of at least one false alarm climbs: 1 − specificityn. This assumes independent rounds, so it's an upper bound — real repeat tests are correlated (often lower), while testing for many conditions at once adds more chances each round (can be higher). Hover the chart to read any round count.

0%25%50%75%100%15101520Number of test roundsCumulative false-positive riskMammography ×10: 49% cumulative false-positive risk (Elmore, NEJM 1998)Multimodal PLCO ×14: ~60% in men (49% in women) (Croswell, Ann Fam Med 2009)10 × → 65.1%
1 − specn at current specificity Real-study reference points

Reference points (≥1 false positive): Elmore et al., NEJM 1998 — 49% after 10 mammograms · Croswell et al., Ann Fam Med 2009 — ~60% (men) / 49% (women) after 14 multimodal PLCO tests.

Bayesian updating

Bayesian updating: learning a rate from data

Where do numbers like sensitivity or prevalence come from? You start with a prior belief, observe data, and get a posterior. For a rate, this is exact and runs right here — no simulation: posterior = Beta(α+k, β+n−k).

0%25%50%75%100%Rate (e.g. sensitivity or prevalence)
Prior Beta(2,2) Data 7/10 Posterior Beta(9,5)
Posterior mean
64.3%
95% credible interval
39%–86%
Prior mean → data
50% → 70%
2
2
10
7

The posterior mean sits between your prior mean and the observed rate — and the more data you collect, the more the data wins and the tighter the interval. That shrinking uncertainty is what the screening sliders quietly assume away by treating each rate as a fixed number.

How to use this calculator

  1. Set the population — how many people you screen.
  2. Set the prevalence — the pre-test probability, i.e. how common the condition is in that group. This is the single biggest driver of predictive value.
  3. Set the test’s sensitivity and specificity — how well it catches the condition and how well it clears the healthy.
  4. Set treatment uptake and the NNT / NNH — how many test-positives go on to treatment, and how often that treatment helps or harms.

Every view — the cascade, the 2×2 confusion matrix, the Bayes tree, the prevalence→PPV curve, the Fagan nomogram, the serial-testing curve, and the helped / harmed outcomes — recomputes live as you change any input. Use the cited examples next to each control to drop in real-world figures.

A worked example: why a positive test can still be a false alarm

Take 1,000 people, a condition with 1% prevalence, and a test that is 90% sensitive and 90% specific — the values this tool starts with. Of the 1,000, about 10 have the condition and 990 do not. The test correctly flags 9 of the 10 true cases (true positives) but also wrongly flags 99 of the 990 healthy people (false positives). So 108 people get a positive result, yet only 9 actually have the condition — a positive predictive value of about 8%. The result feels alarming, but most positives are false. That gap between a test’s accuracy and what a positive result actually means is the base-rate fallacy, and it is the whole point of this tool.

Frequently asked questions

What is positive predictive value (PPV)?

PPV is the probability that someone who tests positive truly has the condition. Unlike sensitivity and specificity — which are properties of the test — PPV also depends heavily on prevalence: the rarer the condition, the lower the PPV, even for a very accurate test.

Why can a positive test still mean you probably don’t have the condition?

When a condition is rare, the healthy group is so much larger than the sick group that even a small false-positive rate produces more false positives than true positives. In the worked example above, a 90% / 90% test at 1% prevalence gives a PPV of only about 8%.

How do you calculate PPV from sensitivity, specificity, and prevalence?

By Bayes’ theorem: PPV = (sensitivity × prevalence) ÷ [ sensitivity × prevalence + (1 − specificity) × (1 − prevalence) ]. The negative predictive value (NPV) is the mirror image for people who test negative.

What is a likelihood ratio, and why doesn’t it depend on prevalence?

A likelihood ratio summarizes how much a result shifts the odds of disease: LR+ = sensitivity ÷ (1 − specificity); LR− = (1 − sensitivity) ÷ specificity. Because they are built only from the test’s sensitivity and specificity, likelihood ratios are independent of prevalence — which is exactly why a Fagan nomogram can turn any pre-test probability into a post-test probability.

What is a Fagan nomogram?

A three-column chart: draw a line from your pre-test probability through the test’s likelihood ratio and it lands on the post-test probability. It is a visual form of Bayesian updating, and this tool draws one live.

Why does repeated (serial) screening raise the chance of a false positive?

Each additional round is another opportunity for a false alarm. If rounds were independent, the cumulative chance of at least one false positive is 1 − specificityn. Real repeat tests are correlated, so that is an upper bound — but in practice the rates are high: about 49% after 10 mammograms (Elmore, 1998) and roughly 60% after 14 multimodal screening rounds (Croswell, 2009).

What is the difference between NNT, NNH, and NNS?

NNT (number needed to treat) is how many people must be treated for one to benefit; NNH (number needed to harm) is how many before one is harmed; NNS (number needed to screen) folds the whole chain together — how many must be screened for one person to be helped.

References & sources

This is an educational model, not a description of any specific test. The example figures offered next to each control are drawn from primary literature; key sources include:

Each input’s example callout links its own primary source. Formulas: PPV / NPV via Bayes; LR+ = sens / (1−spec); LR− = (1−sens) / spec; NNT = 1 / ARR; NNS = screened ÷ helped.

Educational model — not medical advice. It illustrates the statistics of testing and treatment; it does not describe any specific real-world test.

Simplifying assumptions: everyone is screened; test-positives are the ones treated (adjustable via uptake); treatment benefit accrues only to true positives while harm can affect anyone treated, modeled as disjoint groups; the serial-testing curve assumes independent rounds, so it is an upper bound. Formulas — PPV/NPV via Bayes, LR+ = sens/(1−spec), LR− = (1−sens)/spec, NNT = 1/ARR, NNS = screened ÷ helped.