The 95 percent confidence threshold is a convention from Ronald Fisher in 1925. Here is why it is the wrong question to ask about A/B test results, and what to ask instead.
Every growth team running A/B tests has encountered the same moment: the experiment has been running for two weeks, the conversion rate for variant B is 5.3 percent against control's 4.9 percent, and the statistical significance meter reads 89 percent. Should you call it? The tool says no — you have not reached the conventional 95 percent threshold. Most teams wait, even if the practical case for shipping the winner already seems clear.
The 95 percent threshold, also expressed as a p-value of 0.05, is not a scientific law. It is an arbitrary convention that entered statistics practice through a series of historical accidents, became embedded in academic publishing requirements, and migrated into A/B testing tools because the tools were built by statisticians trained in academic methods. Understanding where it came from and what it actually means unlocks a more useful way of thinking about experiment results.

Ronald Fisher published his seminal work on statistical methods for research workers in 1925. In it, he suggested that an experimenter might reasonably regard a result as significant when it would occur by chance no more than one time in twenty under the null hypothesis — in other words, when the p-value falls below 0.05. Fisher himself later clarified that this was a convenient threshold for a single experiment, not a universal standard for all scientific inference.
The threshold was subsequently adopted by academic journal editors as a publication criterion. Papers reporting results with p greater than 0.05 were less likely to be published. Researchers responded rationally to this incentive by designing studies to reach the threshold, running additional participants when results approached but did not cross it, and — in some disciplines — reporting only the analyses that cleared it. These practices, now widely recognized as publication bias and p-hacking, emerged directly from treating an arbitrary threshold as a binary gate between valid and invalid results.
The migration of p=0.05 into A/B testing tools was not a deliberate design decision in most cases. It was a default. Tool developers needed a significance threshold, the academic literature used 0.05, so 0.05 became the default. Many tools do not even surface the option to change it. Teams using these tools inherit a century-old convention without any explicit decision-making about whether it is appropriate for their specific testing context.
The p-value answers a specific question: if there were truly no difference between the control and variant conversion rates, what is the probability of observing a difference at least as large as the one we measured? A p-value of 0.05 says: if the null hypothesis is true, there is a 5 percent chance we would see a result this extreme.
Crucially, the p-value does not answer the question that practitioners actually care about: given the results we observed, what is the probability that the variant is genuinely better than the control? That question is asking for a posterior probability — the probability of a hypothesis given the data — which is what Bayesian methods compute. The p-value is a frequentist construct that inverts the direction of inference: it tells you about the data given a hypothesis, not about the hypothesis given the data.
This inversion is the source of most misinterpretations. A result with p=0.03 does not mean there is a 97 percent probability that the variant is better. It means that if the variant had no effect, there is a 3 percent chance of observing results at least as extreme as these. The practical difference matters: a team that treats p=0.03 as "97 percent confident the variant is better" is drawing a conclusion the statistics do not support.
Choosing a significance threshold is a decision about how to balance two types of errors. A Type I error — a false positive — occurs when you declare a winner that is not genuinely better. A Type II error — a false negative — occurs when you fail to detect an improvement that is real. These errors are in tension: lowering the significance threshold to 99 percent reduces false positives but increases false negatives. Raising it to 80 percent does the opposite.
The conventional 95 percent threshold was chosen in scientific contexts where false positives are expensive and false negatives are less costly. In drug trials, for example, a false positive means approving an ineffective treatment, which has direct patient safety implications. A false negative means failing to approve an effective treatment, which is also costly but less directly harmful. The asymmetry in costs justifies a threshold that aggressively controls false positives.
In website conversion experiments, the asymmetry is different. Shipping a losing variant is costly — you are degrading conversion rate for all future visitors until the error is discovered and corrected. But failing to ship a winning variant is also costly — you are leaving conversion improvement unrealized for all future visitors until you eventually run the experiment again. The error costs are more symmetric, which argues for a significance threshold lower than 95 percent in many cases.
If your experiment is testing a low-stakes change — button color, minor copy adjustment — a false positive costs relatively little. Running the experiment longer to achieve 95 percent confidence has a real opportunity cost. In this context, 80 or 85 percent significance may be appropriate. If the experiment tests a major structural change to a core page — a full homepage redesign, a pricing page layout change — the cost of a false positive is higher, and a more conservative threshold is justified.
Bayesian A/B testing reframes the question entirely. Instead of asking "is the p-value below 0.05?", it asks "given everything we have observed, what is the probability that variant B produces higher conversion rate than control?" The answer is directly interpretable: an 87 percent posterior probability of improvement means there is an 87 percent chance the variant is genuinely better, given the data collected so far.
This framing has several practical advantages. First, it can be updated continuously as data arrives, without the false positive inflation that occurs when frequentist tests are checked repeatedly before reaching their predetermined sample size. Second, it quantifies uncertainty in a way that is directly useful for decision-making: you can pre-commit to shipping at a 90 percent threshold and stop the experiment when that threshold is crossed, regardless of when it occurs. Third, it naturally incorporates prior information — if you have run similar tests before and have a sense of typical effect sizes for this type of change, a Bayesian prior can incorporate that information and improve the precision of early estimates.
Webyn's analysis engine uses Bayesian updating throughout the experiment lifecycle. The dashboard shows posterior probability of improvement in real time, rather than a binary significant/not-significant status. Teams can set their own probability threshold based on the risk tolerance appropriate to the specific experiment, rather than defaulting to an arbitrary convention.
An extension of Bayesian analysis that is particularly useful for business decision-making is expected loss. Expected loss quantifies the expected cost of making the wrong decision — shipping the inferior variant — given the current state of evidence. It combines the probability that the decision is wrong with the magnitude of the error.
Consider two experiments. In the first, variant B has an 85 percent posterior probability of being better, and the estimated effect size is 0.1 percentage points — a trivial improvement on a 5 percent baseline. The expected loss of shipping B is low because even if the decision is wrong, the cost is small. In the second experiment, variant B has an 85 percent posterior probability of being better, but the estimated effect size is 2 percentage points. The expected loss of shipping B is higher because if the decision is wrong, the cost in foregone conversion is significant.
Decision rules based on expected loss are more aligned with business outcomes than decision rules based on p-value thresholds. They encourage shipping quickly when the expected cost of error is low and holding for more evidence when the expected cost of error is high — which is exactly the behavior a rational decision-maker would exhibit. Implementing expected-loss-based stopping rules requires Bayesian infrastructure but produces decisions that are systematically better calibrated to business risk.
The most immediate practical implication is to stop treating 95 percent as a sacred threshold and start treating significance thresholds as a deliberate choice that should vary by experiment type and risk profile. Document your threshold choices explicitly in your experiment design process. For low-stakes tests, use 80 to 85 percent. For high-stakes tests, use 95 percent or higher. Never default to a threshold you have not consciously chosen.
The second implication is to invest in switching your testing infrastructure to a Bayesian analysis engine if you have not already done so. The practical advantages — continuous monitoring without false positive inflation, interpretable probability outputs, natural incorporation of prior information — are significant enough that the migration cost is nearly always justified for teams running more than a handful of experiments per year.
The third implication is to educate your stakeholders about what experiment results mean. Many of the organizational dysfunctions around A/B testing — pressure to declare winners early, discomfort with non-significant results, over-confidence in significant results — stem from misunderstanding the underlying statistics. Clear communication about what your significance threshold means and why you chose it reduces pressure to cut corners and improves the quality of decisions made from experiment data.
Webyn's Bayesian engine shows probability of improvement in real time — no arbitrary thresholds, no mandatory sample sizes. Set your decision threshold based on your risk tolerance and ship when the evidence is right.
Talk to Our Team