Bell | Blog | 10/90 vs 10/10 in A/B Testing: How to Split Users When Unequal Allocation Is Required

Allon Korem

Chief Executive Officer

Note: This post was written in collaboration with Oryah Lancry-Dayan, Lead Statistician

‍

Prolog:

Running an A/B test involves a long chain of decisions for an analyst: sharpening the hypothesis, selecting the right primary KPI, defining acceptable error rates, and calculating the required sample size. Yet one decision that often flies under the radar is how to allocate users between groups.

The default approach is to split users equally between control and treatment (50/50). This rule of thumb exists for a good reason: an equal allocation maximizes statistical power. However, theory and practice do not always align, and there are situations where deviating from equal allocation is necessary.

In these cases, a natural follow-up question arises: how should the groups be structured? Suppose you want only 10% of traffic exposed to the treatment. One option is to keep equal group sizes but run the test on a smaller subset of users (a 10%–10% split). Another is to assign a small fraction to treatment while using the remainder of the population as control (a 10%–90% split).

In what follows, we examine the trade-off between these two designs. We begin by understanding the statistical implications of deviating from the classical 50/50 split, and then compare balanced (e.g., 10–10) and unbalanced (e.g., 10–90) allocation approaches in terms of statistical error and practical considerations.

The Impact of Balanced and Unbalanced Allocation on Test duration

To understand how allocation affects experiment runtime, two key points matter. First, for a fixed power and effect size, total sample size is minimized under equal allocation; a 50/50 split is the most statistically efficient (Figure 1).

**Figure 1.** The total sample size multiplier (N/Nmin) required to maintain statistical power as the allocation to the control group deviates from the optimal 50/50 split (indicated by the vertical dashed line). A 10/90 split is highlighted to show a 2.8x increase in the required sample size.

‍

Second, with unequal allocation, while the total sample size grows, the treatment group itself becomes smaller. This happens because a larger control group provides a more precise baseline estimate, allowing fewer treated users while maintaining statistical power. Analytically, the required treatment size scales with the allocation ratio as:

\( N_{\text{unbalanced}} = N_{\text{balanced}} \cdot 4r(1+r)^2 \)

How does this translate into the difference between a balanced (10–10) and an unbalanced (10–90) design? A balanced design corresponds to equal allocation, where users are split evenly between treatment and control. This allocation minimizes the total number of users required to achieve a given level of statistical power. However, when treatment exposure is constrained (e.g., only 10% of users can receive the treatment), a balanced design effectively uses only a subset of the available traffic, since a large portion of users (in the 10–10 example, 80%) remains unassigned to any experimental condition.

In contrast, an unbalanced design corresponds to unequal allocation, where users are intentionally split asymmetrically between groups. This approach increases the total sample size required to maintain statistical power, but it reduces the number of users exposed to the treatment while making full use of the available traffic.

To illustrate this trade-off, consider a scenario where only 10% of users can be exposed to the treatment. Suppose a power analysis shows that a balanced design (10–10 allocation) requires 10,000 users per group, for a total of 20,000 users. Under a 10–90 allocation, the control group is nine times larger (r = 9), and the scaling relationship implies that the treatment group requires only about 36% of the sample size needed under equal allocation. In this case, the treatment group would include approximately 3,600 users, with roughly 32,400 users in control, for a total of 36,000 users.

Although this total is larger than in the balanced design, the unbalanced approach makes use of the entire traffic pool, whereas the balanced design leaves a substantial portion of users unexposed.

To make this trade-off more concrete, assume the product receives 1,000 users per day. The table below compares the expected runtime under each scenario.

Unsurprisingly, a 50–50 allocation yields the fastest test. But once unequal allocation is required, an interesting pattern emerges: in this example, a 10–10 design would take almost three times longer to complete!

The Impact of Balanced and Unbalanced Allocation on Error Rate

In the previous section, we saw that unbalanced allocation can reduce runtime compared with a balanced design. But before declaring a clear winner, it’s important to remember that unbalanced designs may skew the distribution and slow convergence to normality, potentially affecting test validity and increasing the likelihood of errors.

Which errors? In A/B testing, statistical inference generally addresses two key questions:

Hypothesis testing evaluates whether a difference between groups is real or simply due to chance. The key is controlling the false positive rate, or alpha: the probability of incorrectly detecting an effect when none exists.
Confidence intervals estimate the difference between groups, providing a range that contains the true effect with probability 1–alpha. A coverage error occurs if the true value falls outside the interval, which should happen with probability alpha.

To examine how allocation affects false positives and coverage errors, we simulated 5,000 tests using balanced (r = 1) and unbalanced (r = 9) designs with three highly skewed revenue datasets (skewness: 11.82, 28.63, 40.17) to ensure ecological validity. For false positives, control and treatment groups were sampled from the same distribution. For coverage, a true effect was simulated by scaling treatment observations by 1 ± MDE (depending on hypothesis direction), and we assessed whether the confidence intervals captured this effect. We examined three factors that interact with skewness and influence error rates:

Sample size: 1K, 5K, 15K users in treatment.
Hypothesis direction: left-, right-, and two-tailed tests.
Outlier treatment: winsorization applied or not.

‍

**Figure 2.** False positive rate as a function of treatment sample size for three datasets (rows). Columns show results for two-sided (left), left-tailed (middle), and right-tailed (right) hypotheses. Colors indicate allocation ratio: balanced (10–10, red) and unbalanced (10–90, blue). Line type denotes whether winsorization was applied (dashed) or not (solid). The black horizontal line marks the nominal error rate (α = 0.05).

‍

Our simulations confirm that, as expected, datasets with higher skewness are more prone to inflated false positive rates when using an unbalanced design. This effect is observed for two-sided and left-tailed hypotheses, whereas for right-tailed hypotheses, the error rate is actually below the nominal threshold. Furthermore, applying winsorization markedly reduces skewness and brings the false positive rate to the intended level.

**Figure 3.** Coverage as function of treatment sample size for three datasets (columns). Colors indicate allocation ratio: balanced (10–10, red) and unbalanced (10–90, blue). Line type denotes whether winsorization was applied (dashed) or not (solid). The black horizontal line marks the nominal coverage rate (0.95, for α = 0.05).

‍

A similar pattern appears for confidence interval coverage: unbalanced designs without winsorization show lower-than-desired coverage. However, as sample size increases and winsorization is applied, coverage approaches the intended level.

‍

The Impact of Balanced and Unbalanced Allocation on Differences between Groups

So far, we have focused on the statistical properties of unbalanced allocation, highlighting its ability to reduce experiment duration and the importance of sample size, hypothesis type and winsorization for maintaining test validity. In practice, however, unbalanced designs can make experiments more vulnerable to operational effects that would otherwise affect groups more evenly. When group sizes differ substantially, system dynamics can interact with the allocation itself, introducing unintended biases.

For example, in an unbalanced design, the smaller treatment group may be more prone to cookie churn. Experiments typically rely on cookies to track users across sessions. Since users sometimes lose or refresh cookies, those in smaller groups are more likely to be re-randomized into a different group. This can have two key consequences:

Sample Ratio Mismatch (SRM): Users from the smaller group may end up in the larger group more often than intended, causing the actual allocation to deviate from the planned design and potentially biasing treatment effect estimates.
User Experience Inconsistency: Users in smaller groups have a higher chance of encountering multiple variants, which can lead to inconsistent features, messaging, or interfaces, potentially influencing engagement or behavior.

Another potential issue arises when groups share system resources. Unbalanced designs can introduce bias if users compete for limited resources, such as LRU caches. The larger variant naturally occupies more cache entries, which can give it a performance advantage (e.g., faster page loads) over the smaller variant. This may confound experimental results by creating an artificial boost unrelated to the treatment itself.

To Balance or Not to Balance: Choosing the Right Experiment Design

The goal of this blog was to highlight key considerations when unequal allocation is needed for business reasons. While unbalanced designs (e.g., a 10–90 split) can reduce experiment duration, several factors require careful attention. Let’s summarize the main takeaways:

Skewness: The primary statistical feature affecting the validity of an unbalanced test is the level of skewness. Highly skewed distributions converge more slowly to normality, increasing the likelihood of errors. There are several factors that can impact the degree of skewness:
- Sample size: Larger sample sizes improve the convergence of the sampling distribution to normality, helping maintain the desired error rate even under unbalanced designs.
- Hypothesis type: Ensure the hypothesis direction aligns with the skew of the metric. For highly skewed metrics, testing in the opposite direction of skew can inflate false positives, especially when sample size is small. In many A/B testing scenarios, this condition is naturally satisfied, for example, right-tailed hypotheses (treatment increases revenue) on right-skewed metrics such as revenue.
- Outlier treatment: Handling outliers with winsorization can reduce skewness and improve the statistical validity of the test and coverage rate.
User identification: Consider how users are tracked. When experiments rely on cookies, unbalanced allocation can amplify cookie churn, leading to sample ratio mismatches or inconsistent user experiences. These risks are mitigated when users are identified through persistent IDs, such as logins.
Shared resources: Check whether variants compete for shared system resources like caches or recommendation engines. In unbalanced designs, larger variants may gain disproportionate access, artificially boosting performance metrics. Where possible, resources should be partitioned or configured independently of variant size.

While the main advantage of unbalanced allocation is faster experiments, operational factors can introduce unintended biases. As a best practice, running an A/A test with the intended allocation ratio helps ensure the design does not create sample ratio mismatches or other unintended differences between variants. Ultimately, there is no one-size-fits-all answer: choosing between balanced and unbalanced designs depends on the context and the characteristics of the testing environment and system.

‍