Note: This post was written in collaboration with Oryah Lancry-Dayan, Lead Statistician
Primum non nocere, "First, do no harm", is a fundamental ethical principle in medicine. In A/B testing, the bar is typically set higher, with companies striving not just to avoid harm but to drive improvements. However, there are cases where merely ensuring no harm is sufficient. This is especially true when changes are motivated by broader factors, such as legal requirements or design refreshes, where the decision to implement the change is already agreed upon, but companies seek to confirm that the modifications do not cause a significant decline in key metrics.
To better understand why demonstrating “no harm” can be sufficient in certain contexts, we can once again look to an example from medicine. During the COVID-19 pandemic, researchers explored whether lower-dose booster shots could be used without compromising efficacy. Fractional doses offered potential advantages, including reduced costs, fewer side effects, and increased public willingness to receive boosters. In this case, the goal was not to prove that the fractional dose was superior to the standard dose, but rather to demonstrate that it was not significantly less effective, thereby justifying a dose reduction in light of its other advantages.
The distinction between “do no harm” and “do good” goals significantly impacts the type of test you choose and the likelihood of obtaining a meaningful result. In this blog, we’ll explore two key statistical approaches related to this distinction: superiority tests and non-inferiority tests (also known as "do no harm" testing). We’ll examine their differences and provide a practical guide to designing and interpreting non-inferiority tests.
The traditional and perhaps more intuitive approach to A/B testing is superiority testing, where the goal is to identify a clear winner by comparing two product versions. In this setup, analysts typically compare the control version to a new version, aiming to show that the new version outperforms the old one.
However, in some cases, a new version is desired, but it must first be verified that its performance is not significantly worse than the existing one. In this scenario, the focus shifts to a non-inferiority test, which seeks to demonstrate that the new version’s performance falls within an acceptable, predefined margin. This margin represents the maximum performance decline that is considered tolerable before adopting the change. If the new version meets this criterion, it can be confidently implemented, even if it’s not strictly superior to the original.
The key distinction between superiority and non-inferiority tests lies in how the test’s hypotheses are defined. In A/B testing, we work with two competing hypotheses: the null and the alternative. The process begins with assuming the null hypothesis is true, and then calculating the probability of obtaining the observed data assuming this hypothesis holds. If this probability is below a predefined threshold (typically 5%), it suggests that the null hypothesis is unlikely, and we reject it in favor of the alternative hypothesis.
To better illustrate the difference between superiority and non-inferiority tests, let's consider an example (see also figure 1 below). Imagine an online store evaluating the impact of a branding redesign, such as changing its company logo. To make a well-informed decision, the company conducts a comparison between the revenue generated by the old logo and that of the new one.
In the first scenario, the company will only implement the new design if there is strong evidence that it improves revenue. Thus, the goal is to determine if the new design performs better than the old one. To do this, the company will use a superiority test. The hypotheses are as follows:
H0 : μt - μc≤ 0
H1 : μt - μc> 0
These hypotheses reflect the goal of demonstrating that the test group's mean revenue (μt) exceeds that of the control group (μc).
In a second scenario, the company is interested in implementing the new design, but only if it does not cause a revenue loss greater than 2%. The goal is to ensure that the new design isn’t worse than the old one by more than a specified margin (2% in this case). To test this, the company will use a non-inferiority test. The hypotheses are as follows:
H0 : μt - μc≤ -2%
H1 : μt - μc> -2%
Comparing the two examples, it’s clear that the alternative hypothesis in the non-inferiority test is less stringent than in the superiority test. Specifically, if a superiority test yields a significant result indicating a difference greater than zero, it will automatically imply a significant result for the non-inferiority test, which only requires the difference to be greater than -2%. However, the reverse is not true: a significant result in the non-inferiority test does not necessarily indicate a significant result in the superiority test.
Non-inferiority tests are particularly useful when changes are necessary due to branding, compliance, or backend optimizations. In such cases, there’s usually a general consensus to implement the change, unless there is a strong indication that it would have disastrous consequences. The goal of the test, therefore, is to ensure that the change does not result in a significant degradation of key performance metrics.
Here are some situations where non-inferiority testing can be particularly relevant:
The key step in designing a non-inferiority test is setting the non-inferiority margin (Δ), which defines the maximum acceptable decline in performance for the new version. There is a tradeoff between the size of the margin and the statistical power of the test. If the margin is too small, the ability to detect a significant effect decreases. Thus, using an overly conservative criteria, increases the risk of avoiding a necessary change . Conversely, if the margin is too large, it may lead to accepting a variation that substantially harms performance, an undesirable outcome even if the new version is highly sought after.
To determine an appropriate margin, two key factors should be considered. First, historical data can provide an estimate of potential performance loss based on past reductions in key performance indicators (KPIs). Second, business insights should inform the decision by evaluating the level of risk the company is willing to accept, specifically, what degree of performance loss remains acceptable. By integrating these two sources, one can establish a margin that balances expected changes with practical business considerations.
After setting the margin (Δ) it is time to formulate the hypotheses for a right-tail test:
H0 : μc - μt≤ Δ
H1 : μc - μt> Δ
Importantly, note that we are seeking for a significant result that will indicate that the difference between the two versions is larger than the margin.
In this context, it's worth addressing a common request we hear from customers: the desire to set the non-inferiority margin to zero. At first glance, this may seem to align with the intuitive goal of "doing no harm", that is, demonstrating that two versions are exactly equal. However, from a statistical standpoint, this is not feasible.
The core issue is that even when two product versions are truly identical, random variation between samples will almost always produce some observed difference. To conclusively prove that two versions are exactly the same, one would need to observe the entire population, a practically impossible task. In statistical terms, requiring a margin of zero would imply an infinite sample size, because you'd be demanding absolute certainty that no meaningful difference exists. In fact, if you run a power analysis with a margin of zero, that’s exactly what you'll find: the required sample size approaches infinity.
While a non-inferiority test with a zero margin may appear mathematically similar to a superiority test, the key difference lies in the assumed effect size used in power analysis. In a superiority test, you assume a positive expected difference and test whether it exceeds zero. In contrast, a non-inferiority test with a zero margin assumes an expected effect size of zero, and tests whether it is not worse than zero. This offers no room for natural variability, and paradoxically, results in an underpower test compared to a typical superiority test.
As with any A/B test, there are two possible outcomes:
Rejecting H₀: If the p-value is smaller than the significance level (typically 0.05), we reject the null hypothesis. This suggests that there is sufficient evidence to support the claim that the new variant is not worse than the control by more than the predefined margin, meaning it is non-inferior. However, it is important to note that the new version could still perform worse than the control, but the difference must fall within the acceptable margin. In this case, constructing a confidence interval will probably yield an interval where the predefined margin is not included (note that non-inferiority tests use one-sided hypotheses, while confidence intervals are typically two-sided. As a result, the margin may occasionally fall within the interval even when the p-value is significant, though this is unlikely).
Failing to Reject H₀: If the p-value is larger than alpha, we fail to reject the null hypothesis. This means the results are inconclusive, and we cannot confidently claim that the new treatment is non-inferior to the control. It is possible that the new version performs better than the control in our sample, but not sufficiently enough to exceed the margin for non-inferiority. In this case, constructing a confidence interval will probably yield an interval that includes the predefined margin.
Our experience working with companies in the industry has shown us that integrating non-inferiority tests into the A/B testing process requires more than just an understanding of the statistical principles behind them. It also necessitates a cultural shift in how A/B tests are conducted within the organization. We often encounter two key cultural challenges that need to be addressed for successful implementation.
First, there are instances where non-significant results in a test are mistakenly interpreted as evidence of non-inferiority. This can occur in two scenarios: (1) when a superiority test fails to yield a significant result, despite a positive difference, or (2) when a left-tailed test does not show that the treatment is worse than the control, leading to an incorrect assumption of non-inferiority. Importantly, a non-significant result does not provide any meaningful conclusion about the test’s outcome. Therefore, it's crucial to recognize that failing to find evidence that a version is better than the control is not the same as proving that the version is at least as good as the control within a predefined margin.
Second, once non-inferiority tests are implemented, some companies may overuse them in situations where the goal is actually to drive improvement. Since non-inferiority tests are generally more powerful, they are more likely to yield significant results. As a result, companies may be tempted to rely on non-inferiority tests to achieve more frequent significant findings and accelerate progress. However, this approach can be problematic in the long term, as it encourages the acceptance of changes that are merely "good enough," potentially leading to decisions that could harm revenue or long-term success.
Taken together, the best practice is to clearly define the goal of each test upfront and select either a superiority or non-inferiority test based on that objective. Non-inferiority tests should be used when appropriate, but their application must be thoughtful and not overused. As our CRO expert, Israel Ben Baruch, wisely put it, “Non-inferiority tests are a necessary part of the conversion rate optimization life-cycle. Not a glorious one - but essential. In my experience, they should comprise about 10% of your experiments, which can increase to 20% if you're operating in a highly regulated industry.
Beyond regulatory requirements, the most common non-inferiority tests revolve around design changes, where product designers believe the current experience is not good enough or outdated. In these cases, I tell my design teams - 'Let's move forward, but we must do no harm.' One example we face just now is testing a new font that we believe will improve readability and create a more updated feeling”.
Non-inferiority tests provide a valuable framework for A/B testing when the goal is to confirm that a change does not degrade performance beyond an acceptable level. By carefully defining hypotheses, setting appropriate margins, and interpreting results correctly, businesses can make informed decisions about product updates, algorithm modifications, and regulatory changes while minimizing risk of harming the product.