Randomization is one of the cornerstones of AB testing. You may choose the correct statistical method for your test, carefully plan the sample size needed to achieve adequate power, and thoroughly confirm that your assumptions are met, but none of these efforts will be worth it if your samples are not random.
Why is randomization so important, and how can we achieve it? These are the two key questions we’ll address in this blog. If you’re unsure about the answers, this post is a must-read for you: without proper randomization, no A/B test can be considered valid!
Let’s imagine that Tom and Jerry have a debate: Is eating meat good for you? To answer this question empirically, they decide to conduct an experiment. Tom gathers a group of meat eaters and records their health scores, while Jerry does the same for vegetarians. After collecting the data, they perform a t-test and find that Jerry’s group of vegetarians appears statistically healthier than Tom’s group of meat eaters.
At first glance, this result might suggest that a vegetarian diet is healthier. However, what if you were told that women tend to follow a vegetarian diet more than men? This would change the interpretation: the difference between the two groups may not be due to their diet, but rather to other lifestyle factors that vary between genders.
This example underscores the critical importance of random allocation. To accurately isolate the impact of a change made to your product, the experimental manipulation must be the only systematic difference between the groups. If this condition isn't met, then even if all procedures are followed correctly and a statistically significant result is observed, we cannot confidently attribute the outcome to the manipulation itself. Instead, unaccounted-for confounding factors may be driving the observed difference.
Methods for achieving randomized sampling span two extremes. On one end, simple randomization requires minimal intervention, essentially encouraging you to do nothing. On the other end, more structured approaches ensure that both groups are carefully balanced to share similar characteristics. In the following sections, we will explore the differences between these approaches, their respective pros and cons, and when each method is recommended.
Before diving into different randomization methods, we find it helpful to distinguish between two related, but different, motivations in the field of randomization. Historically, the primary goal of randomization techniques was to ensure that a sample accurately represents the broader population (for example, ensuring that participants in a political survey reflect the full population of voters).
In the context of A/B testing, however, there is another motivation. Not only to mirror the overall population, but also to ensure that the control and treatment groups are comparable. To achieve this, randomization methods used in A/B testing build on the same foundational principles as traditional approaches, but the technical implementation and focus often differ.
Adding to the confusion, some terms are used interchangeably across these contexts, whether the goal is population representativeness or groups’ comparability. One of the aims of this blog is to bring clarity to this space. To that end, we’ll walk through three core categories of randomization methods used in A/B testing and introduce the terminology we're most familiar with for each. While some of these terms also appear in the context of single-sample randomization for representativeness, keep in mind that similar names don’t always imply the same method or objective.
The idea behind simple randomization is straightforward: if you randomly select users and assign them to groups, the groups will likely have a similar mix of characteristics. Since simple randomization does not actively ensure that the two groups have similar profiles, relying instead on the assumption that randomness will balance them out, it is crucial to ensure that the randomization mechanism is genuinely random and unaffected by the treatment or fixed across time. For example, if you use a fixed randomization mechanism that consistently assigns users to the same groups (such as assigning users to groups based on the modulus of their user_id), it may work fine for the first experiment. However, in subsequent tests, it can introduce bias, making your conclusions specific to the fixed user mixture rather than generally applicable.
One potential drawback of simple randomization is that, although it often produces groups with similar characteristics, it does not guarantee comparability. The inherent randomness can lead to imbalances in key attributes, which becomes especially problematic when certain characteristics in the population have a strong influence on the test outcome. For instance, simple randomization cannot ensure that users’ countries of origin are evenly distributed across groups. There is a chance that one group ends up with a higher proportion of users from the USA. If users from the USA tend to have a higher likelihood of making purchases, this imbalance could confound the results. In such cases, it becomes difficult to determine whether observed effects are due to the experimental manipulation or to underlying group differences unrelated to the test itself.
To address this issue, analysts may use seed randomization, an approach that sits between purely randomized and strictly controlled methods. Rather than relying on a single random split, seed randomization involves generating multiple random groupings and using historical data of the users to choose a balanced one. In practice, it means that users are repeatedly divided into groups before the experiment begins. For each division, historical user data is used to evaluate group balance based on key factors relevant to the test. If the groups are not sufficiently comparable on those factors, a new random seed is applied to generate a different division. This process is repeated until a grouping is found that achieves acceptable balance across the specified attributes.
The final randomization approach moves away from simple randomization altogether, opting instead for controlled allocation. In this method, historical user data is used beforehand to deliberately assign users to groups with similar characteristics. This stands in contrast to seed randomization, where users are first randomly assigned to groups, and those groupings are later evaluated for balance. In controlled allocation, the process is reversed: we assess user characteristics first, then actively construct comparable groups based on those attributes.
This method effectively integrates two traditional approaches: stratified sampling and block randomization. As a result, you may encounter terms like stratified randomization, blocked randomization, or randomized block design to describe this technique.
The concept is straightforward. First, key characteristics are selected for balancing, and users are grouped into different strata based on combinations of those characteristics. For example, suppose an analyst wants to balance groups based on two factors: whether a participant tends to make payments in the app (paying vs. non-paying user), and their level of experience with the app (more than one month vs. less than one month). This results in four strata: (1) Paying user + long-term user, (2) Paying user + recent user, (3) Non-paying user + long-term user and (4) Non-paying user + recent user.
Within each stratum, participants are then assigned to groups using a block randomization method. To that end, a block size is predetermined and a randomized sequence of group labels (e.g., A and B) is generated. Users within the stratum are ordered and assigned sequentially according to this randomized label sequence. Once a block is filled, a new randomized sequence is generated for the next block.
To illustrate, let’s revisit our earlier example: suppose the block size is 4 and the groups are A and B. One possible sequence could be ABBA. The first user in the stratum would be assigned to group A, the second to B, the third to B, and the fourth to A. Once this block is complete, a new sequence is generated, say, BBAA, and the next four users are assigned accordingly. This process ensures that the proportion of users in each group remains balanced within each stratum.
If you are still a bit confused about the different randomization methods, no worries! The illustration below will help tie everything together.
Now that you're familiar with the different randomization methods used in A/B testing, it's time to consider which method is most appropriate for your experiment. Specifically, there are three key questions every analyst should ask when choosing a randomization strategy:
1. Which users are participating in the experiment?
One of the most important distinctions between randomization methods lies in whether or not you have prior knowledge about the users. Simple randomization requires no historical information: it can be applied to any users as they arrive. In contrast, seed randomization and stratified randomization depend on knowing the cohort in advance and having access to relevant user data.
Therefore, the first consideration when choosing a randomization strategy is the availability of historical data. If you're testing on new users and no pre-test data is available, simple randomization may be the only viable option.
2. Which type of allocation is involved in the experiment?
User allocation into testing conditions can be carried out in two main ways: online or offline. In online allocation, users are assigned to groups dynamically as they interact with the system. In contrast, offline allocation involves assigning users to groups in advance, before they engage with the system. The choice between these approaches often depends on the characteristics of the test population, for example, offline allocation isn’t feasible for new users without prior data. It may also be influenced by infrastructure constraints (e.g., if the company’s allocation system only supports online assignment) or performance considerations (e.g., offline assignment may reduce runtimes). While a detailed discussion of online vs. offline allocation is beyond the scope of this blog, it’s important to recognize that the chosen methodology directly impacts which randomization strategies are available. In particular, online allocation generally limits you to simple randomization, since more advanced techniques, such as stratified or seed randomization, require grouping users in advance.
3. Do I Really Need That?
When selecting a randomization method, analysts must consider whether a more complex approach is necessary, or if a simpler one will suffice. The answer to this question is less binary than the previous ones and often relies on the analyst's judgment and experience. Key factors to consider include the sensitivity of the Key Performance Indicator (KPI) to specific characteristics of the groups. For instance, if test results are highly sensitive to a particular feature (e.g., paying vs. non-paying users), relying solely on simple randomization could introduce unintended imbalances. In such cases, more advanced methods like stratified or seeded randomization may be more appropriate.
Ultimately, choosing the right randomization method requires balancing practical feasibility with the complexity of implementation, ensuring the method aligns with the specific needs of the test and the company.
Ultimately, the choice between simple randomization and more advanced methods depends on your study’s objectives and the characteristics of your test population. If your priority is to ensure group comparability and reduce bias, techniques like seeded or stratified randomization offer greater control and precision. However, these approaches may not be practical when dealing with new users who lack historical data. If you're interested in a deeper comparison of these methods, including their impact on statistical power, group balance, and KPI distributions, stay tuned for our next blog post, where we’ll explore these trade-offs in detail!