The problem of multiple comparisons represents a fundamental clash between statistical inference and human intuition. After all, how can simply looking at the data multiple times or analyzing several key performance indicators (KPIs) alter the pattern of results?
The goal of this blog is to explain why our intuition leads us astray in this case and to clarify the multiple comparisons problem. We will then provide a brief overview of common methods used to address it. For a more in-depth exploration of these techniques, we invite you to read our next blog, where we delve deeper into the different correction methods.
Our intuition about the harmlessness of multiple comparisons often arises from a reasonable but flawed understanding of statistical processes. Many people mistakenly believe that statistical tests provide definitive conclusions. However, this is not the case; statistical tests yield probabilistic outcomes, not certainties. If conclusions were definite, then repeatedly analyzing data or testing multiple KPIs wouldn’t affect the results. Unfortunately, that's not how probability works. Each additional comparison increases the likelihood of encountering a false positive, also known as Type I error.
To illustrate, let’s consider a shooter aiming at a target. Suppose the probability of missing the target with one shot is 10%. If the shooter fires once, the chance of missing at least once is straightforward: 10%. But what happens if the shooter takes two shots, or even 100?
As the number of shots increases, so does the probability that at least one shot will miss the target. Similarly, in multiple hypothesis testing, each additional test carries its own risk of a false positive. The more tests you conduct, the higher the overall probability of encountering at least one significant result, even if there’s no real effect.
How much does this risk increase? Take a look at the illustration below, which shows how the probability of obtaining at least one significant result (at a 5% alpha level) increases as more independent tests are performed. This occurs even though both groups were sampled from the same distribution, meaning, there is no real difference between them!
Now that we understand why multiple comparisons are problematic, let’s look at some common scenarios in A/B testing where this issue often arises:
The primary principle in addressing multiple comparisons is to make the criterion for rejecting the null hypothesis more stringent. By doing so, we ensure that even with multiple comparisons, the overall probability of a Type I error remains within the desired threshold. While this may sound simple, it is far from straightforward. Making it harder to reject the null hypothesis also reduces the ability to detect a true effect when one exists, which can lower the test's power.
Therefore, statisticians face the challenge of developing methods that maintain the probability of false positives below the desired level (alpha) while minimizing the negative impact on the test’s ability to identify true effects. Over the years, several statistical correction methods have been developed to address this challenge. These methods differ in the type of errors they control, whether they limit the probability of making at least one false decision or the proportion of false decisions among all significant results. Here, we provide a brief overview of these methods. If you're interested in a more detailed comparison, stay tuned for our next blog post!
Multiple comparisons problems are an inevitable challenge in A/B testing, but they can be managed with the right statistical techniques. Whether adjusting for multiple KPIs, handling peeking, or testing several variations, applying appropriate corrections ensures reliable and actionable insights. By integrating these methods into your experimentation framework, you can reduce false positives and make more data-driven decisions with confidence.