Bell | Blog | Correct me if I'm wrong: Navigating multiple comparison corrections in A/B Testing

Allon Korem

Chief Executive Officer

Note: This post was written in collaboration with Oryah Lancry-Dayan, Lead Statistician

In the world of A/B testing, one common challenge is the issue of multiple comparisons. This occurs when multiple hypothesis tests are conducted simultaneously, whether it’s peeking at the data during the experiment, examining several key performance indicators (KPIs), or analyzing different segments of the population. In a previous blog, we explored in detail the risks of inflating false positives (alpha) in these situations. In this post, we will dive into the various methods for addressing multiple comparisons, examining their strengths and weaknesses, and providing guidance on when each method is most appropriate to use.

Behind the scenes: What is the core difference between correction methods?

Before delving into the technical details of each correction method, it’s important to first understand a key theoretical distinction that sets them apart. Specifically, these methods differ in the type of error they aim to control below a predefined threshold. . Let’s take a closer look at each of these possible errors.

Family-Wise Error Rate (FWER)

The earliest methods for addressing multiple comparisons treated false positives in a binary way. Specifically, when conducting a series of tests, detecting even a single false significant result was enough to count as an error. In other words, the presence of just one significant result would label the entire family of tests as having a false positive, regardless of how many tests were performed.

Building on this rationale, traditional correction methods focused on controlling the family-wise error rate (FWER): the probability of making one or more Type I errors (false positives) across all tests in a family. The goal of FWER control is to ensure that the probability of making at least one false positive does not exceed the pre-specified significance level (typically 0.05).

False Discovery Rate (FDR)

More modern methods for handling multiple comparisons challenge the binary principle at the core of FWER: is rejecting one test in error the same as rejecting five? Simple intuition tells us that the number of rejected tests matters as well. The False Discovery Rate (FDR) is built on exactly this intuition. Rather than controlling the probability of at least one rejection (as FWER does), FDR aims to control the proportion of tests rejected falsely (out of all rejected tests). In other words, it ensures that, on average, no more than a predefined proportion of tests are rejected mistakenly in each cohort of tests.

FWER vs. FDR

In the following sections, we will describe in detail several methods for addressing multiple comparisons. Some methods, such as Bonferroni or Dunnett, aim to control the FWER, while others, like the Benjamini-Hochberg procedure, focus on controlling the FDR. Understanding the distinction between FWER and FDR is key to understanding these methods. FWER-control methods are more conservative, ensuring a stringent elimination of false positives, but at the cost of potentially failing to detect true effects. On the other hand, FDR methods allow for some false discoveries but work to keep their proportion within an acceptable limit, offering a less conservative but more powerful alternative to traditional methods.

Solutions for multiple comparisons

Now that we have a solid understanding of error control methods, let’s dive into some of the most commonly used techniques (Bonferroni, Dunnett, and FDR) and explore the solutions they offer.

Bonferroni Correction

The Bonferroni correction is one of the simplest and most conservative methods for controlling the FWER. It involves adjusting the significance level for each individual test to ensure that the probability of making at least one Type I error (false positive) across all tests remains below a specified threshold (typically 0.05).

Principle: The Bonferroni method keeps the FWER constant by adjusting the significance level for each comparison. If you have m tests, the new significance threshold for each test is: α_Bonferroni=α/m. Where α is the original significance level. This means that, in the original test, you would reject the null hypothesis if the p-value is less than α. After applying the Bonferroni correction, however, you would reject the null hypothesis only if the p-value is smaller than α/m, where m is the number of tests. For example, with an alpha of 5% and 5 tests, you would reject the null hypothesis for p-values lower than 0.01, instead of 0.05.

Dunnett’s Test

Dunnett’s test is specifically designed for comparing multiple treatment groups to a single control group. Unlike the Bonferroni correction, Dunnett’s test is more powerful because it does not treat every pairwise comparison equally, only comparisons to the control group are tested. By accounting for the dependency between hypotheses (since all groups are compared to the control), Dunnett’s correction offers a less conservative approach to controlling the FWER.

Dunnett's method is more complex than the Bonferroni correction, and its core innovation lies in comparing the test statistic to a more stringent critical value than the standard t-distribution. To achieve this, Dunnett developed an adjusted form of the t-distribution, which results in a stricter threshold than the regular distribution, but a more relaxed one compared to the Bonferroni correction.

Benjamini-Hochberg (BH):

The Benjamini-Hochberg (BH) procedure focuses on limiting the proportion of false discoveries (incorrect rejections of the null hypothesis) among the rejected hypotheses. This approach is less conservative than FWER control and is more suitable when the researcher is willing to accept some false positives in exchange for greater power.

The BH procedure works as follows:

Rank all p-values in ascending order.
For each p-value, calculate i/m · α, where i is the rank of the p-value (according to step 1) and m is the total number of tests.
Find the largest rank (k) for which the p-value is smaller than the value calculated in step 2.
Reject all hypotheses till rank k.

Correction for multiple comparisons in action

If this all sounds a bit abstract, let’s make it concrete with an example. Imagine you're a data analyst at an e-commerce company, tasked with investigating how the layout of product listings on the website influences revenue. The product manager isn't sure which layout strategy works best and proposes testing 10 new layouts against the current one. Your goal is to identify which layouts outperform the control.

What you don’t know, however, is that only 3 of the 10 proposed layouts actually lead to higher revenue, while the remaining 7 perform no better than the control. Since you're comparing 10 treatment groups to a single control, you inevitably face the issue of multiple comparisons. But what does this mean in practice, and how do different correction methods influence your error rates?

To explore this, we simulate a scenario with:

1 control group, drawn from a normal distribution with a mean of 100 and a standard deviation of 12.
7 treatment groups, sampled from the same distribution as the control (i.e., no true effect).
3 treatment groups, each with a true revenue uplift of 2.5% (mean = 102.5).

To examine how different statistical correction methods impact error rates, we compare each treatment group to the control under four conditions: no correction, Bonferroni correction, Dunnett’s test, and the Benjamini-Hochberg (BH) procedure. Using a significance level (α) of 0.05, we compute:

The proportion of significant results among the three treatment groups with true effects.
Whether any of the seven null-effect groups were incorrectly flagged as significant.
The proportion of false rejections among all significant findings.

To illustrate, suppose a simulation yields four significant results: two from groups with true effects and two from groups without. In this case, the results would be:

2 out of 3 true effects were correctly detected → 0.67
At least one null hypothesis was falsely rejected → 1
2 out of 4 significant results were false positives → 0.5

We repeat this procedure across 1,000 simulations and average across these measures to obtain reliable estimates of: (1) power, (2) family-wise error rate (FWER), and (3) false discovery rate (FDR).

The table below summarizes the results of the simulation:

**Table 1.** Summary of simulation results.
The table highlights the tradeoff between falsely rejecting the null hypothesis and correctly identifying true effects. Without any correction, FWER is extremely high, but so is statistical power. Applying Bonferroni or Dunnett corrections effectively controls FWER below the desired alpha (0.05), but this comes at the cost of lower power, making it harder to detect true effects. BH offers a middle ground: while FWER exceeds alpha, FDR remains controlled below alpha, allowing for a higher power than the stricter corrections. This balance makes BH a useful alternative when some false positives are acceptable in exchange for better detection of true effects.

A special case of multiple comparisons: Peeking

In multiple comparisons, peeking refers to making decisions about which hypotheses to test based on observed data, often after examining interim results. If the number of peeks is predefined, traditional correction methods like the Bonferroni adjustment can still be applied by adjusting the rejection threshold accordingly. However, in many cases, analysts do not specify the number of peeks in advance, making these methods ineffective. Additionally, strict corrections like Bonferroni significantly reduce statistical power.

More powerful methods, such as the BH procedure, are also unsuitable in this context because they require all p-values to be available at once for ranking. In peeking, tests are conducted sequentially rather than simultaneously, meaning the full set of p-values is not available at any given moment, rendering BH inapplicable. Additionally, when peeking at the data, the data aggregates over time, leading to a strong dependence between multiple tests. This dependence can be leveraged to improve statistical power, an advantage not utilized by the previously discussed methods, which assume the tests are independent.

To address multiple comparisons in peeking scenarios, sequential testing provides a more effective solution. This approach evaluates data continuously or at predefined intervals, rather than waiting for a fixed sample size before making a decision. In this technique, analysts monitor results as data accumulates. At each step, a statistical test is performed, leading to one of three possible actions:

Reject the null hypothesis if the evidence is strong enough.
Continue collecting data if the evidence is inconclusive.
Stop the test early if there is sufficient confidence that the null hypothesis should not be rejected.

This process continues until a predefined stopping rule is met. However, repeated testing increases the risk of Type I errors (false positives), making statistical correction methods essential. One common solution is the use of alpha spending functions, which allocate the overall significance level across multiple tests to control the cumulative error rate. Another approach is the Mixture Sequential Probability Ratio Test (mSPRT), which continuously evaluates the likelihood of observing the data under the null hypothesis versus the alternative. The test stops once this likelihood ratio exceeds a predefined threshold, signaling sufficient evidence to reject the null.

Bottom line: How to choose the right correction method?

Selecting the right method for multiple comparisons depends on the context of your research, the ease of implementation and the trade-off between controlling false positives (Type I errors) and preserving statistical power (avoiding Type II errors):

Use the Bonferroni correction if you need strict control over Type I errors and are willing to sacrifice some power for increased rigor. This method is particularly ideal when false positives have serious consequences or when you have a small number of tests, so the reduction in power is minimal. One of the key advantages of the Bonferroni correction is its simplicity and ease of implementation.
Use Dunnett’s Test when comparing multiple treatment groups to a single control, as it is more powerful than Bonferroni in this specific setting.
Use the Benjamini-Hochberg (BH) procedure if you're willing to accept a controlled number of false positives in exchange for higher statistical power. This approach is particularly beneficial in exploratory studies, where detecting potential effects is more important than eliminating every false positive. Additionally, the BH procedure is well-suited for situations with a large number of tests, such as when analyzing multiple segments or numerous KPIs, where traditional correction methods may overly constrain the power.
Use sequential testing if multiple comparisons arise from peeking at accumulating data, such as in real-time monitoring or adaptive experiments.

Conclusion

Addressing the multiple comparisons problem is crucial for making valid statistical inferences. By carefully selecting the appropriate correction method based on the objectives of your analysis and the trade-offs between false positives and false negatives, you can improve both the reliability and the power of your results.