In the world of A/B testing, one common challenge is the issue of multiple comparisons. This occurs when multiple hypothesis tests are conducted simultaneously, whether it’s peeking at the data during the experiment, examining several key performance indicators (KPIs), or analyzing different segments of the population. In a previous blog, we explored in detail the risks of inflating false positives (alpha) in these situations. In this post, we will dive into the various methods for addressing multiple comparisons, examining their strengths and weaknesses, and providing guidance on when each method is most appropriate to use.
Before delving into the technical details of each correction method, it’s important to first understand a key theoretical distinction that sets them apart. Specifically, these methods differ in the type of error they aim to control below a predefined threshold. . Let’s take a closer look at each of these possible errors.
Family-Wise Error Rate (FWER)
The earliest methods for addressing multiple comparisons treated false positives in a binary way. Specifically, when conducting a series of tests, detecting even a single false significant result was enough to count as an error. In other words, the presence of just one significant result would label the entire family of tests as having a false positive, regardless of how many tests were performed.
Building on this rationale, traditional correction methods focused on controlling the family-wise error rate (FWER): the probability of making one or more Type I errors (false positives) across all tests in a family. The goal of FWER control is to ensure that the probability of making at least one false positive does not exceed the pre-specified significance level (typically 0.05).
False Discovery Rate (FDR)
More modern methods for handling multiple comparisons challenge the binary principle at the core of FWER: is rejecting one test in error the same as rejecting five? Simple intuition tells us that the number of rejected tests matters as well. The False Discovery Rate (FDR) is built on exactly this intuition. Rather than controlling the probability of at least one rejection (as FWER does), FDR aims to control the proportion of tests rejected falsely (out of all rejected tests). In other words, it ensures that, on average, no more than a predefined proportion of tests are rejected mistakenly in each cohort of tests.
FWER vs. FDR
In the following sections, we will describe in detail several methods for addressing multiple comparisons. Some methods, such as Bonferroni or Dunnett, aim to control the FWER, while others, like the Benjamini-Hochberg procedure, focus on controlling the FDR. Understanding the distinction between FWER and FDR is key to understanding these methods. FWER-control methods are more conservative, ensuring a stringent elimination of false positives, but at the cost of potentially failing to detect true effects. On the other hand, FDR methods allow for some false discoveries but work to keep their proportion within an acceptable limit, offering a less conservative but more powerful alternative to traditional methods.
Now that we have a solid understanding of error control methods, let’s dive into some of the most commonly used techniques (Bonferroni, Dunnett, and FDR) and explore the solutions they offer.
Bonferroni Correction
The Bonferroni correction is one of the simplest and most conservative methods for controlling the FWER. It involves adjusting the significance level for each individual test to ensure that the probability of making at least one Type I error (false positive) across all tests remains below a specified threshold (typically 0.05).
Principle: The Bonferroni method keeps the FWER constant by adjusting the significance level for each comparison. If you have m tests, the new significance threshold for each test is: αBonferroni=α/m. Where α is the original significance level. This means that, in the original test, you would reject the null hypothesis if the p-value is less than α. After applying the Bonferroni correction, however, you would reject the null hypothesis only if the p-value is smaller than α/m, where m is the number of tests. For example, with an alpha of 5% and 5 tests, you would reject the null hypothesis for p-values lower than 0.01, instead of 0.05.
Dunnett’s Test
Dunnett’s test is specifically designed for comparing multiple treatment groups to a single control group. Unlike the Bonferroni correction, Dunnett’s test is more powerful because it does not treat every pairwise comparison equally, only comparisons to the control group are tested. By accounting for the dependency between hypotheses (since all groups are compared to the control), Dunnett’s correction offers a less conservative approach to controlling the FWER.
Dunnett's method is more complex than the Bonferroni correction, and its core innovation lies in comparing the test statistic to a more stringent critical value than the standard t-distribution. To achieve this, Dunnett developed an adjusted form of the t-distribution, which results in a stricter threshold than the regular distribution, but a more relaxed one compared to the Bonferroni correction.
Benjamini-Hochberg (BH):
The Benjamini-Hochberg (BH) procedure focuses on limiting the proportion of false discoveries (incorrect rejections of the null hypothesis) among the rejected hypotheses. This approach is less conservative than FWER control and is more suitable when the researcher is willing to accept some false positives in exchange for greater power.
The BH procedure works as follows:
If this all sounds a bit abstract, let’s make it concrete with an example. Imagine you're a data analyst at an e-commerce company, tasked with investigating how the layout of product listings on the website influences revenue. The product manager isn't sure which layout strategy works best and proposes testing 10 new layouts against the current one. Your goal is to identify which layouts outperform the control.
What you don’t know, however, is that only 3 of the 10 proposed layouts actually lead to higher revenue, while the remaining 7 perform no better than the control. Since you're comparing 10 treatment groups to a single control, you inevitably face the issue of multiple comparisons. But what does this mean in practice, and how do different correction methods influence your error rates?
To explore this, we simulate a scenario with:
To examine how different statistical correction methods impact error rates, we compare each treatment group to the control under four conditions: no correction, Bonferroni correction, Dunnett’s test, and the Benjamini-Hochberg (BH) procedure. Using a significance level (α) of 0.05, we compute:
To illustrate, suppose a simulation yields four significant results: two from groups with true effects and two from groups without. In this case, the results would be:
We repeat this procedure across 1,000 simulations and average across these measures to obtain reliable estimates of: (1) power, (2) family-wise error rate (FWER), and (3) false discovery rate (FDR).
The table below summarizes the results of the simulation:
In multiple comparisons, peeking refers to making decisions about which hypotheses to test based on observed data, often after examining interim results. If the number of peeks is predefined, traditional correction methods like the Bonferroni adjustment can still be applied by adjusting the rejection threshold accordingly. However, in many cases, analysts do not specify the number of peeks in advance, making these methods ineffective. Additionally, strict corrections like Bonferroni significantly reduce statistical power.
More powerful methods, such as the BH procedure, are also unsuitable in this context because they require all p-values to be available at once for ranking. In peeking, tests are conducted sequentially rather than simultaneously, meaning the full set of p-values is not available at any given moment, rendering BH inapplicable. Additionally, when peeking at the data, the data aggregates over time, leading to a strong dependence between multiple tests. This dependence can be leveraged to improve statistical power, an advantage not utilized by the previously discussed methods, which assume the tests are independent.
To address multiple comparisons in peeking scenarios, sequential testing provides a more effective solution. This approach evaluates data continuously or at predefined intervals, rather than waiting for a fixed sample size before making a decision. In this technique, analysts monitor results as data accumulates. At each step, a statistical test is performed, leading to one of three possible actions:
This process continues until a predefined stopping rule is met. However, repeated testing increases the risk of Type I errors (false positives), making statistical correction methods essential. One common solution is the use of alpha spending functions, which allocate the overall significance level across multiple tests to control the cumulative error rate. Another approach is the Mixture Sequential Probability Ratio Test (mSPRT), which continuously evaluates the likelihood of observing the data under the null hypothesis versus the alternative. The test stops once this likelihood ratio exceeds a predefined threshold, signaling sufficient evidence to reject the null.
Selecting the right method for multiple comparisons depends on the context of your research, the ease of implementation and the trade-off between controlling false positives (Type I errors) and preserving statistical power (avoiding Type II errors):
Addressing the multiple comparisons problem is crucial for making valid statistical inferences. By carefully selecting the appropriate correction method based on the objectives of your analysis and the trade-offs between false positives and false negatives, you can improve both the reliability and the power of your results.