This post is the fourth in our blog series on testing for digital organizers. Today I’ll be talking about implementing your A/B test. This post will be full of helpful, quick tips.
So, we’ve discussed some things you might want to test, and some other things you might not want to test. Then, we walked through a simple way to figure out the number of people you’ll need in each of your test groups, which number depends on the smallest difference you’d like to reliably detect.1 Now what?
Well, the short answer is “run the test”, but of course it’s never that simple. Your next specific steps depend on what you’re testing, as well as which platform you’re using to run the test. There are too many possibilities for me to go through each one, but I can provide a few quick tips that should apply to you regardless of your specific situation.
First, make sure you have a reliable method of tracking your variations’ performance (like reference codes or an A/B testing tool (here are instructions for using ours)), and make sure you actually implement that method. This may sound like a no-brainer, but we’ve seen plenty of people start what would otherwise be an excellently set-up test with nothing to measure the variations’ relative performance! Is there a joke here about the “results” of the test?
Groaners aside, pointing out that error isn’t at all to make fun of the people who have committed it. Rather, we’re all busy, and things can get hectic. Having this on your pre-send checklist2 will save you from the realization that a lot of time spent thinking up a test, creating the content, and so on ad nauseam was all for naught.
What’s an example? Well, say you’re testing email content for donations. And of course, you want to use the best online fundraising software in the whole wide world, so you’re using ActBlue. Well, we have a handy feature that allows you to generate reference codes to track donations. We have a full instruction guide for using reference codes on our tutorial, found here. If you’re testing two different versions of your email, you could attach the URL param3
refcode=variation_a to the links in your first email and
refcode=variation_b to those in your second email. Then, when you go to https://actblue.com/pages/[YOUR_PAGE_NAME]/statistics, you can measure the performance of each email. The information will also appear in a .csv download of your contribution form donations.
We also allow a handy
refcode2 URL param if you want to conveniently subdivide your tracking. Conceptually, it’s the exact same thing as
refcode; it’s value lies in the fact that it’s an extra place to store information. Think of a backpack with an extra divider on the inside for sorting your stuff. This is the internet version of that. For example, we use this for tracking link placement in the email. The need for
refcode2, however, indicates that your test might be a bit complicated (i.e., there are more than just two variations, so setup and evaluation of the test is a bit outside the scope of the tips in this testing series.) That’s no problem, but you might want to shoot us an email at digital [at] actblue [dot] com to have a chat about test setup and design.
My second tip is related to groups. Taking your list—or some subset of your list—and dividing it up into smaller, randomized groups is a step that you’ll likely do in your CRM or email tool. Unfortunately, I can’t provide detailed instructions for each one. Chances are, though, that your CRM has an instruction page on how to do this within their software.4 In any case, this step is critical: without at least randomizing before conducting your trial, you’re setting yourself up for failure.
Here’s an example of how to do it wrong: let’s say you’re testing two emails, and even though you’re not sure which one is better, you have a hunch that email B is better than email A. So, not wanting to lose out on money, you decide to assign 20,000 people with the highest previous donations to group B and 20,000 people with the lowest previous donations to group A. That way, you can conduct the test to find out which email is definitely better, but not have to lose too much money along the way, right? Well, that’d be great, but unfortunately it’s all wrong. Assigning your groups that way would all but ensure you draw false conclusions about your test–email B is all but certain to bring in more donations, but it’s because it was assigned high-propensity donors, not necessarily because it’s the better email. Make sure you’re at least randomizing (with a proper algorithm, q.v. footnote 4) before splitting your groups and implementing your test.
My third tip is short and sweet. After you do all of this legwork, how do you know that the right variations were sent to the right number of people? What if you’re working with eight groups instead of just two? Well, the answer is that you don’t really. But, that can (and should!) be remedied. Place your own email address in each of the test groups. This won’t significantly affect the results of the test, but it will allow you to be sure that the right variations were sent. “But, I only have one email address, how can I put myself in multiple test groups without the hassle of creating new emails?”, you ask. Use the old email-campaigner’s trick of adding a “+” to your email address if you have a Gmail-based address. For example, if your email address is email@example.com, you can add firstname.lastname@example.org to group A and email@example.com to group B; they’ll both be delivered to your inbox, and you’ll be able to perfectly spot whether the variations were sent correctly.
My fourth and last tip of the day is the most important one of all. Remember going through the process of determining your required sample size? Well, we did that for a (lengthily explained) reason. Don’t deviate from that now. What the hell am I talking about? I’m talking about peaking at the results too early (viz., before you reach your necessary sample size.)
I get it. You spent a lot of time setting up a test for these awesome variations of, say, a contribution form, and even though you know you need to wait until 15,000 people land on the form to see results, you want to check what’s happening? Has either taken an early lead? etc., etc., etc.
You can check what’s happening along the way, but you should definitely not stop the test early because it looks like one variation is performing better.5 This is a really common mistake, but a deadly one. I can’t stress this enough. The more times you test two variations for significance (which we’ll talk about in a future post) before the required sample size is hit, the more likely you are to detect a false positive. In fact, you can pretty quickly render your test effectively useless. So, if you just have to see what’s going on, fine, but promise yourself and statisticians everywhere that you won’t act on what you see!
Ok, that’s it for today! Next we’ll talk about evaluating your results and even more importantly, learning from them!
1 as well as your tolerance for the probability of getting a false positive and false negative, though using standard values can take some of the difficulty of this decision making away
2 which, if you don’t have a pre-send checklist (we prefer old-fashioned paper, big check boxes, and sharpies!), you should make one ASAP
3 A way of passing messages from the URL back to the website which it can use to customize the display or data recorded.
4 Now, this is generally the most basic possible insurance for proper group setup, as most tools will do nothing more than randomize and divide. There are other steps that should be taken for running anything more complex than a simple A/B test, which steps tend to best be done with a statistical tool such as R. If you think something more complicated is in-line for your program, don’t hesitate to shoot us an email (digital [at] actblue [dot] com)– we’d love to work with you to see if something more complicated is in order, and if so, we’d be glad to help.
5 Saying “definitely” in a conversation about statistics is— if also delightfully ironic— a bit misleading. This is actually a really complicated topic with plenty of proffered solutions, which range from minor adjustments in your calculations to an entirely different philosophical approach to statistics (I mean, who knew, right?). Those are all great discussions to have, but for now, it’s probably best to just assume you shouldn’t repeatedly evaluate your test variations before you hit your required sample size. Ok? Cool.