Bell | Blog | The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem

Chief Executive Officer

Note: This post was written in collaboration with Oryah Lancry-Dayan, Lead Statistician

The problem of multiple comparisons represents a fundamental clash between statistical inference and human intuition. After all, how can simply looking at the data multiple times or analyzing several key performance indicators (KPIs) alter the pattern of results?

The goal of this blog is to explain why our intuition leads us astray in this case and to clarify the multiple comparisons problem. We will then provide a brief overview of common methods used to address it. For a more in-depth exploration of these techniques, we invite you to read our next blog, where we delve deeper into the different correction methods.

The problem: the risk of false positives

Our intuition about the harmlessness of multiple comparisons often arises from a reasonable but flawed understanding of statistical processes. Many people mistakenly believe that statistical tests provide definitive conclusions. However, this is not the case; statistical tests yield probabilistic outcomes, not certainties. If conclusions were definite, then repeatedly analyzing data or testing multiple KPIs wouldn’t affect the results. Unfortunately, that's not how probability works. Each additional comparison increases the likelihood of encountering a false positive, also known as Type I error.

To illustrate, let’s consider a shooter aiming at a target. Suppose the probability of missing the target with one shot is 10%. If the shooter fires once, the chance of missing at least once is straightforward: 10%. But what happens if the shooter takes two shots, or even 100?

As the number of shots increases, so does the probability that at least one shot will miss the target. Similarly, in multiple hypothesis testing, each additional test carries its own risk of a false positive. The more tests you conduct, the higher the overall probability of encountering at least one significant result, even if there’s no real effect.

How much does this risk increase? Take a look at the illustration below, which shows how the probability of obtaining at least one significant result (at a 5% alpha level) increases as more independent tests are performed. This occurs even though both groups were sampled from the same distribution, meaning, there is no real difference between them!

**Figure 1.** Probability of falsely rejecting the null hypothesis (alpha) as a function of the number of tests.
As shown in the graph, while the error probability is 5% for a single test, it increases dramatically as the number of tests grows. For instance, if you're conducting 10 tests, there’s a 40% chance of obtaining a significant result by mistake, even though the true null hypothesis is valid. In practical terms, this means that if you check 10 different KPIs and decide to switch versions based on a single significant result, there is a 40% chance of making this decision even if the new version is no better than the control. If you had analyzed a single KPI, the risk of this error is still present but is limited by the alpha value (usually 5%).

When multiple comparisons problems arise

Now that we understand why multiple comparisons are problematic, let’s look at some common scenarios in A/B testing where this issue often arises:

Peeking: It’s natural to be curious about a test’s progress before reaching the planned stopping point. However, repeatedly checking test results for the same KPI and stopping the test once a significant result is found is clearly a case of multiple comparisons, which significantly increases the likelihood of false discoveries.
Segment Analysis: Analysts often want to examine how a test version impacts specific user subpopulations, such as those based on location, device type, or traffic source. Running tests across multiple segments introduces several statistical comparisons, increasing the chance of errors.
Multiple KPIs: Analysts frequently track multiple KPIs, such as conversion rate, engagement, and revenue per user. Making a decision based on any significant result across multiple KPIs increases the likelihood of a false positive, compared to relying on a single predefined primary KPI.
A/B/C/n Testing: In A/B/C/n testing, where multiple variations are tested instead of just a simple A/B test, the number of statistical comparisons increases. This amplifies the need for adjustments or corrections to mitigate the increased risk of false positives.

How to deal with multiple comparisons

The primary principle in addressing multiple comparisons is to make the criterion for rejecting the null hypothesis more stringent. By doing so, we ensure that even with multiple comparisons, the overall probability of a Type I error remains within the desired threshold. While this may sound simple, it is far from straightforward. Making it harder to reject the null hypothesis also reduces the ability to detect a true effect when one exists, which can lower the test's power.

Therefore, statisticians face the challenge of developing methods that maintain the probability of false positives below the desired level (alpha) while minimizing the negative impact on the test’s ability to identify true effects. Over the years, several statistical correction methods have been developed to address this challenge. These methods differ in the type of errors they control, whether they limit the probability of making at least one false decision or the proportion of false decisions among all significant results. Here, we provide a brief overview of these methods. If you're interested in a more detailed comparison, stay tuned for our next blog post!

Bonferroni Correction: This method adjusts the significance threshold by dividing the alpha level by the number of tests performed. While this method is easy to understand and implement and highly effective at reducing false positives, it is very conservative and can substantially reduce statistical power, especially when many tests are involved.
Dunnett’s Test: Ideal for comparing multiple variations against a single control group, Dunnett’s test reduces the likelihood of false positives while being less stringent than the Bonferroni correction. This makes it a more powerful option when dealing with multiple comparisons.
Benjamini-Hochberg (BH) Procedure: Particularly useful in exploratory analyses with multiple KPIs. This method allows for more power while still managing the rate of false discoveries, making it more appropriate for large-scale testing and scenarios with many comparisons.
Sequential Testing: Designed for situations where results are monitored continuously, such as in "peeking". One approach to sequential testing involves adjusting the thresholds for rejecting the null hypothesis based on the amount of data collected. This allows an ongoing monitoring of the data without inflating the error rate as the test progresses.

Conclusion

Multiple comparisons problems are an inevitable challenge in A/B testing, but they can be managed with the right statistical techniques. Whether adjusting for multiple KPIs, handling peeking, or testing several variations, applying appropriate corrections ensures reliable and actionable insights. By integrating these methods into your experimentation framework, you can reduce false positives and make more data-driven decisions with confidence.

The more the merrier? The problem of multiple comparisons in A/B Testing

The problem of multiple comparisons represents a fundamental clash between statistical inference and human intuition. In this blog post, we'll explain why our intuition leads us astray and review common methods used to address it.

The problem: the risk of false positives

When multiple comparisons problems arise

How to deal with multiple comparisons

Conclusion