multi-armed bandit algorithm slot machine graphic

The "multi-armed bandit" name comes from a casino analogy: imagine a row of slot machines, each with an unknown payout probability. You want to maximize your total payout over a fixed number of pulls. You do not know which machine is best. Do you split pulls evenly to gather information? Or do you concentrate on the ones that have paid out so far? The optimal strategy — and this is provable — is neither extreme. It is a calculated balance of exploration and exploitation.

Applied to A/B testing, each "arm" is a variant. Each "pull" is a visitor. The "payout" is a conversion. And the algorithm's job is to maximize total conversions over the experiment period, not just measure which variant is best at the end.

That objective difference is the key. Traditional A/B testing optimizes for measurement precision. Multi-armed bandit optimizes for performance during the test. They are solving different problems, and which one you should use depends on which problem matters more for your situation.

Thompson Sampling: how it works in practice

Thompson Sampling is the most widely used bandit algorithm in CRO, and for good reason — it performs well across a wide range of conditions and is relatively straightforward to implement and explain.

The core idea: maintain a probability distribution over each variant's conversion rate. At the start, with no data, each variant's distribution is broad and uncertain — it could be anywhere from 0% to 100%. As conversions are recorded, the distributions narrow around the actual observed rates.

To decide where to send the next visitor, the algorithm samples one value from each variant's distribution. Whichever variant produces the highest sampled value gets the next visitor. Because better-performing variants have distributions concentrated at higher values, they are more likely to produce high samples and therefore receive more traffic. But because the distributions still have variance, occasionally a lower-performing variant gets sampled high and receives traffic — this is the exploration component.

As more data arrives, the distributions for clearly inferior variants narrow to low ranges, and those variants receive increasingly little traffic. A variant that is genuinely better eventually has a narrow distribution concentrated above the others, and it receives the vast majority of traffic — but exploration never entirely stops.

The regret framework: quantifying the cost of losers

In bandit theory, "regret" measures the opportunity cost of exploration — the conversions lost by showing inferior variants instead of always showing the best one. Thompson Sampling is provably sub-linear in regret, meaning the average cost per visitor of showing the wrong variant decreases as experiment length increases.

For a standard A/B test with 50/50 allocation, regret is linear. Every visitor shown the inferior variant throughout the test period is a lost conversion. A 4-week test on a page with 3,000 daily visitors showing a variant that converts 1% worse than the control costs roughly 1,260 conversions over the test period, assuming the effect is real. That number is directly proportional to experiment length and traffic volume.

With Thompson Sampling on the same test, traffic shifts toward the better variant as evidence accumulates. By week three, the superior variant might be receiving 80% of traffic, reducing the "loser cost" by a factor of four over the remaining period. The exact savings depend on how quickly the effect manifests and how large it is, but on tests with true effects of 10% or more, the savings are substantial.

When bandit outperforms A/B testing

Bandit algorithms have a clear advantage in specific conditions:

Large effect sizes: When the true difference between variants is large — say, 15% or more in conversion rate — bandit algorithms identify the winner quickly and concentrate traffic aggressively. The opportunity cost of exploration is minimal, and the method gets to exploitation fast.

Short-horizon decisions: If you have a promotional period — a sale running for two weeks, a product launch window — bandit allocation maximizes performance during that window, even if the experiment is not formally "complete" from a measurement standpoint. For time-bounded commercial decisions, this is often the correct objective.

Exploratory testing with many variants: Testing five or six variants simultaneously with 50/50-style allocation is impractical because you need proportionally more traffic. Bandit algorithms handle many-armed scenarios naturally, quickly funneling traffic away from clearly poor performers and concentrating on the top two or three candidates.

Situations where you care more about outcomes than measurements: If you will implement the winner regardless of confidence interval width, and you do not need an exact estimate of the effect size for downstream modeling, bandit's reduced measurement precision does not matter. You want the winner deployed; the statistical details are secondary.

When A/B testing remains better

Bandit is not always the right tool. Fixed-allocation A/B testing has clear advantages in several contexts:

Precise effect estimation: If you need to know not just which variant wins but by how much — for pricing decisions, revenue forecasting, or convincing stakeholders who want confidence intervals — bandit's uneven allocation produces harder-to-interpret estimates. The control group may have received 20% of traffic by the end; comparing it to a variant with 80% of traffic produces asymmetric confidence intervals.

Non-stationary environments: If conversion rates drift significantly over the experiment period due to seasonality, marketing changes, or external events, bandit algorithms can over-exploit early-period winners that are not actually better over the full horizon. Fixed-allocation tests with balanced randomization are more robust to temporal confounds.

Regulatory or audit requirements: Any context requiring formal statistical testing with defined error rates and reproducible methodology. Clinical trials, financial products, insurance pricing — bandit algorithms are not designed to satisfy these requirements, and frequentist methods remain the standard.

Learning-oriented programs: If the goal of your testing program is to build knowledge about what works — not just to maximize current conversions — you want precise measurements that generalize. Bandit algorithms optimize for the present experiment; they do not generate reusable knowledge as efficiently as properly designed A/B tests.

Epsilon-greedy vs. Thompson Sampling

Two algorithms dominate bandit implementations in CRO: epsilon-greedy and Thompson Sampling. They make different tradeoffs.

Epsilon-greedy is simpler: with probability epsilon (say, 10%), send the visitor to a random variant. With probability 1-epsilon (90%), send them to the current best-performing variant. Epsilon is typically fixed or decayed over time. The approach is easy to understand and explain, but it does not adapt exploration intensity to uncertainty — it explores at a fixed rate regardless of how confident the data should make you.

Thompson Sampling adapts exploration automatically. When uncertainty is high (early in the experiment, with limited data), it explores broadly. As data accumulates and posterior distributions narrow, exploration decreases naturally. This produces better regret properties across a wide range of effect sizes without requiring you to tune an exploration parameter.

In practice, Thompson Sampling outperforms epsilon-greedy in most CRO scenarios. The fixed exploration rate of epsilon-greedy means it continues exploring at the same rate late in experiments when confidence is high — wasting traffic on inferior variants after you already know they are worse.

Implementation considerations

Running a bandit algorithm on a real website requires a few practical decisions that affect results. Traffic allocation must happen at the session level, not the page-load level, to avoid showing the same visitor different variants across page loads. This requires either server-side assignment storage or a persistent client-side identifier.

Conversion events may be delayed — a visitor might convert 24 hours after the initial experiment exposure. Bandit updates need to handle delayed rewards appropriately, or early-experiment allocations are made on incomplete information.

For multi-page experiments where the conversion event happens several steps after the initial exposure (e.g., variant shown on homepage, conversion measured at checkout), bandit algorithms need to correctly attribute conversions back to the variant assignment. This attribution pipeline is more complex than single-page experiments.

Hybrid approaches

Some testing platforms use hybrid approaches: run fixed allocation for an initial "exploration phase" to establish baseline estimates, then switch to bandit allocation for the remainder of the experiment. This trades some of the regret reduction from bandit for more reliable early-phase measurement, which can reduce the risk of early over-exploitation when initial samples are too small to trust.

Webyn uses a variant of this approach by default — a minimum exploration period before adaptive allocation begins, with the length calibrated to the site's daily traffic volume. This prevents the algorithm from making aggressive allocation decisions before there is enough data to trust.

Webyn implements Thompson Sampling with automatic exploration tuning, so your experiments balance learning and performance without manual configuration. Talk to our team about how it fits your testing program.

Back to Blog