This post is the third in our blog series on testing for digital organizers. Today I’ll be talking a bit about what an A/B test is and explain how to determine the sample size (definition below) you’ll need to conduct one.
Hey, pop quiz! Is 15% greater than 14%?
My answer is “well, kind of.” To see what I mean, let’s look at an example.
Let’s say you have two elevators, and one person at a time enters each elevator for a ride. After 100 people ride each elevator, you find that 15 people sneezed in elevator 1, and 14 people sneezed in elevator 2.
Clearly, a higher percentage of people sneezed in elevator 1 than elevator 2, but can you conclude with any certainty that elevator 1 is more likely to induce sneezing in its passengers? Or, perhaps, was the difference simply due to random chance?
In this contrived example, you could make a pretty good case for random chance just with common sense, but the real world is ambiguous so decisions can be trickier. Fortunately, some basic statistical methods can help us make these judgments.
One specific type of test for determining differences in proportions1 is commonly called an A/B test. I’ll give a simple overview of the concepts involved and include a technical appendix for instructions on how to perform the procedures I discuss.
Let’s recall what we already said: we can perform a statistical test to help us detect a difference (or lack thereof) between the action rate in two samples. So, what’s involved?
I’ll skip over the nitty-gritty statistics of this, but it’s generally true that as the number of trials2 increases, it becomes easier to tell whether the difference (if there’s any difference at all) between the two variations’ proportions is likely to be real, or just due to random chance. Or, slightly more accurately, as the number of trials increases, smaller differences between the variations can be more reliably detected.
What I’m describing is actually something you’ve probably already heard about: sample size. For example, if we have two versions of language on our contribution form, how many people do we need to have land on each variation of the contribution form to reliably detect a difference (and, consequently, decide which version is statistically “better” to use going forward)? That number is the sample size.
To determine the number of people you’ll need, there are a few closely related concepts (which I explain in the appendix), but for now, we’ll keep it simple. The basic idea is that as the percent difference between variations you wish to reliably detect decreases, the sample size you’ll need increases. So, if you want to detect a relatively small (say, 5%) difference between two variations, you’ll need a larger sample size than if you wanted to be able to detect a 10% difference.
How do you know the percent difference you’d like to be able to detect? Well, a good rule of thumb to start with is that if it’s a really important change (like, say, changing the signup flow on your website), you’d want to be able to detect really small changes, whereas for something less important, you’d be satisfied with a somewhat larger change (and therefore less costly test).
Here’s what that looks like:
For example, if you’re testing two versions of your contribution form language to see which has a higher conversion rate, your typical conversion rate is 20%, and you want to be able to detect a difference of around 5%, you’d need about 26k people in each group .
For instructions on how to find that number, see the appendix below. Once you have determined your required sample size, you’ll be ready to set up your groups and variations, run the test, and evaluate the results of your test. Each of those will be upcoming posts in this series. For now, feel free to email info [at] actblue [dot] com with any questions!
1 Note that this should be taken strictly as “proportions”. Of course, there are many things to be interested in other than the percentage of people who did an action vs. didn’t (e.g., donated vs. didn’t donate), like values of actions (e.g., contribution amounts), but for now, we’ll stick to the former.
2I.e., the number of times something happens. For example, this could be the number of times someone reaches a contribution form.
Statistics is a big and sometimes complicated world, so I won’t explain this in too much detail. There are many classes and books that will dive into the specifics, but I want you to have a working knowledge of a few important concepts you’ll need to complete an accurate A/B test. I’m going to outline four closely related concepts necessary for determining your sample size, and walk through how to find this number. Even though I’m sticking to the basics, this section will be a bit on the technical side of things. Feel free to shoot an email our way with any questions; I’m more than happy to answer any and all.
Like I said, there are four closely related concepts when it comes to this type of statistical test: significance level, power, effect size, and sample size. I’ll talk about each of these in turn, and while I do, remember that our goal is to determine whether we can reject the assumption that the two versions are equal (or, in layman’s terms, figure out that there is a real statistical difference between the two versions).
Significance level can be thought of as the (hopefully small) likelihood of a false positive. Specifically, the probability that you falsely reject the assumption that the two versions are equal (i.e., claim that one version is actually better than the other, even if it’s not.) When you hear someone talk about a p-value, they’re referencing this concept. The most commonly used significance level is 0.05, which is akin to saying “there’s a 5% chance that I claim a real difference, but there’s actually not”.
Power is the the probability that you’ll avoid a false negative. Or said another way, the probability that if there’s a real difference there, you’ll detect it. The standard value to use for this is 0.8, meaning there is an 80% chance you’ll detect it; though there are really good reasons for adjusting this value. 0.8 is by no means always the best value to choose for power; it’s generally a good idea to change it if you know exactly why you’re doing what you’re doing. .08 will work for our purposes, though. Why not just pick a value of .9999, which is similar to saying “if there’s a real difference, there’s a 99.99% chance that I’ll detect it”? Well, that would be nice, but as you increase this value, the sample size required increases. And sample size is likely to be the limiting factor for an organization with a small (say, fewer -than-100k-member) list.
Effect Size. Of the two versions you’re testing against each other, typically you’d call one the ‘control’ and the other the ‘treatment’, so we’ll use those terms. Effect size is saying, what do you expect the proportion of actions (e.g., contributions) to be for the control, and what do you expect it to be for the treatment? The percent difference is the effect size. How this affects sample size is demonstrated in the above graph. But the whole point of running this test is that you don’t know what the two proportions will be in advance, so how can you pick those values? Well, actually, you estimate what your base action rate will be. For example, if your donation rate from an email is typically 5%, then you can use that as your base action rate. Then, for the second proportion, pick the smallest difference you’d like to be able to detect. Similarly to power, you might find yourself asking “well why wouldn’t I just pick the smallest possible difference?”. Again, the answer is that as you decrease the magnitude of the difference, the sample size you need will increase.
Finally, we have sample size, or the number of people we need to run the test on. If we have values for the above three things, we can figure out how big of a sample we need!
So how do we do that? Well, there are many ways to do it, but one of the easiest, best, and most accessible is R. It’s free, open-source, and has an excellent community for support (which really helps as you’re learning). Some might ask, “well that has a relatively high learning curve, doesn’t it? And, isn’t there some easier way to do this?” The answer to both of those questions is “maybe,” but I’ll give you everything you need in this blog post. There are also online calculators of varying quality that you can use, but R is really your best bet, no matter your tech level.
Doing this in R is actually pretty simple (and you’ll pick up another new skill!). After you download, install, and open R, enter the following command:
power.prop.test(p1 = 0.1, p2 = 0.105, sig.level = 0.05, power = 0.8)
and press enter. You’ll see a printout with a bunch of information, but you’re concerned with
n. In this example, it’s about 58k. That number is the sample size for each group you’d need to detect, in this case, a 5% difference at a significance level of 0.05, a power of 0.8, and a base action rate of 10%. So, just to be certain we’re on the same page, a quick explanation:
p1: Your ‘base action rate’, or the value you’d expect for the rate you’re testing. If you’re donation rate is usually 8%, then
p1 = 0.08
p2: Your base action rate plus the smallest percent difference you’d like to be able to detect. If you only care about noticing a 10% difference, and your ‘base action rate’ is 8%, then
p2 = 0.088 (0.08 + (0.08 * 0.10))
Of course, your base action rate will likely be different, as will be the percent difference you’d like to be able to detect. So, substitute those values in, and you’re all set! Playing around with different values for these can help you gain a more intuitive sense of what happens to the required sample size as you alter certain factors.