Email Testing: What works (and what doesn’t) when deciding on fundraising ask amounts?

We’ve always used the ActBlue blog as a place to report back on our work and let everyone who’s part of the larger ActBlue community know about the huge impact their organizing work and their donations have on the progressive movement. Primarily, that means sharing stats and insights into the donations that run through our platform. But as a nonprofit, we also put a lot of resources into running our own email fundraising program, just like so many of the groups using ActBlue to fundraise. And we use our tools to do it!

During this election season, which is already proving to be busier than ever, we want to provide everyone using ActBlue with as much info and data to help guide your strategies as possible. That’s why we’re sharing two tests on the way we generate ask amounts for our ActBlue fundraising emails (and why we plan to keep sharing relevant test results throughout this year). If you have questions or thoughts on these tests, feel free to comment on this post. We hope this gets the gears turning in terms of some new tests for the other fundraising programs out there!


For our email fundraising, we have been re-evaluating how we use a donor’s previous contribution history to tailor the ask amounts we use in emails. We use ActBlue Express Lane for all of our fundraising emails, which allows ActBlue Express users to give simply by clicking a button in an email. Over the years, we’ve optimized the number of ask amounts we include in an email, as well as considered the range and distribution of the amounts we include.

 

express_box

 

We’ve also tested different sets of ask amounts for donors who tend to give different amounts. For example, the buttons for a donor who tends to give $30 once a year will be tailored to that amount and different than those a donor who regularly gives us $3 contributions will see.

To determine those asks, we typically look at our donors’ previous contribution history and use their highest previous contribution amount (HPC) to determine which set of ask amounts to use. If a donor’s HPC was below $25, they would be grouped into a segment of donors that would see the first set of buttons (top) in the image below, whereas if their HPC was between $25 and $50, they would be in the group of donors that saw the middle set of ask amounts.

ask_amounts

 

We fine-tuned these default ask amounts for each group of HPC amounts over many email tests until we saw the gains in our conversion rates plateau. This made us wonder if the conversion rate we had reached based on testing HPC amounts was the highest rate we could achieve, or whether we could reach a higher peak by starting with a different premise. With about six months left in the election cycle, we’ve been focused on maximizing recurring sign-ups rather than one-time contributions, so we started to think about alternatives to HPC that were better tailored towards action rates and recurring signups.

First we tested average previous contributions (APC) for donors who have given more than one contribution. Our hypothesis was that a donor would be more likely to take action when given ask amounts tailored towards the average of their previous contributions rather than their highest contribution amount. We tested this hypothesis by splitting our donors and assigning half of them ask amounts based on their HPC and half based on their APC. We used the same ask amounts we typically use with both of these newly sorted groups. For our list, this proved to be an effective method — we increased the action rate by 20% and the total amount raised by over 4%. We successfully raised more money by getting more contributors to give smaller amounts!

The increase in our action rate indicates our ask amounts based on HPC may have been too high. Donors may be inspired by your ask but not always able to match their own highest contribution. By lowering our ask amounts, we kept more donors engaged and decreased the likelihood they would be alienated by receiving ask amounts greater than their ability or willingness to contribute, while still raising more overall.

From this experiment, we’ve continued to build on our results. We are now working on optimizing ask amounts based on our donors’ APC amount and also tested ask amounts based on the type of contribution a donor has given previously. After all, if your goal is to maximize recurring signups, why ask for the highest previous contribution when that is likely to reflect a one-time contribution rather than a recurring amount? It’s also in line with what we know about small-dollar online giving — people want to give when they’re moved to support your cause, and that is often at a lower level multiple times throughout a cycle. That’s a useful lesson for smaller campaigns and organizations who don’t have the list size to run an involved test like this.

To improve recurring signups, we designed a test that was a bit more complex. We took all of our donors and first split them into three groups: those who had only given one kind of contribution (one-off or recurring) and those who had given both kinds of contributions. We then took the group of donors who had given both kinds of contributions and split them in two again: half were assigned ask amounts based on their HPC and half received amounts based on their highest previous recurring contribution (HPR).

ask_amounts_2

 

The impact? We raised nearly twice as much from donors in the HPR group than those in the HPC group. We found that donors were far more likely to give recurring contributions with ask amounts tailored specifically to their recurring contribution amounts.

The results of these tests reinforced a few things: First, how critical it is to question our basic assumptions about email practices to continue to build on our results. And second, how important it is to think about the kind of program you want to build. Do you really want to maximize a person’s donation once? Or are you more worried about building a robust, sustainable recurring program? Is your campaign in the news a lot, and therefore do you have lots of opportunities to engage people who have already donated? All these factors should play into how you think about your ask amounts.

If you’re an admin at a campaign or organization, you can conduct these tests too! We recommend starting by setting up a webhook to collect and analyze more information on your group’s contribution history. If you’re running a smaller campaign and building your list, you may not have enough data yet to test based on your donors’ contribution history. If that’s the case, we recommend using our Smart Recurring tool to ask one-time donors to add a smaller recurring contribution. This will help you build a robust, small-dollar program and provide more insight into how your donors respond to different ask amounts!

If you have any thoughts or questions about setting up a test like this, please reach out to stern@actblue.com!

A big mobile boost from one simple test

We run optimization tests on our forms constantly, but it’s rare that we see winners as big as our recent mobile test.

At this point we’re running around 40% of all donations made on mobile devices. Anything we can do to keep that number going up, the better, given the massive transition in traffic from desktop to mobile.

In this test, when donors landed on a form, some were moved directly to the point on a form where they’d choose an amount to give — skipping past the contribution blurb they would normally read first, as you can see below. Others got the normal form with the headline and blurb up top.

form

 

The forms that skipped over the blurb won by a landslide. The result was a statistically significant increase of 5.2% conversions on the variation (p < 0.05). It was such a big winner that we automatically rolled it out sitewide.

Most of our tests see a tiny bump, but any increase in conversions makes a big difference when we’re running tens of millions of contributions through our forms each month. And a 5.2% bump? That’s real money.

There were more than 250k new mobile contributions made sitewide in July, with an average contribution size of about $30. That means our mobile form test resulted in about 400k extra dollars. Like I said, real money.

We previously tested a contribution form without a blurb on both desktop and mobile versions to our email list. The results were abysmal, which is why we ran this sitewide test to mobile donors specifically.

By nature, our mobile users are seeking ways to save them time and enhance their convenience when they’re making contributions. And that’s what this switch gives them.

Testing those beautiful brandings

This is the one you’ve all been asking for. We just added the ability to A/B test your brandings!

Now there’s no need to guess whether a giant photo of your candidate or a simple branding will fare better. And with our awesome multi-armed bandit A/B testing suite, you’ll get results as fast as possible.

To set up a branding test, create or open up a contribution form. Navigate to the A/B test tab and give your test a name. Next, check off “Branded layout.” Your first variation will be the branding your form is currently set to (or your default branding). You can choose your variation to test from the dropdown menu. You can also test having no branding by selecting “ActBlue default.” If you need a refresher on creating a branding, check out our guide.

Once you’ve made your selection, click “Create test.” You can monitor the results of your test once you start sending it out to donors.

Unlike a regular A/B test, our multi-armed bandit system of A/B testing will start sending more traffic to the winning version of the test as results come in, so you’ll get results faster and won’t lose money on a subpar variation. And there’s no need to pick a winner — the multi-armed bandit will take care of it for you!

If you have questions about setting up a beautiful branding or an A/B test, let us know at info@actblue.com.

One-stop shop: Visualize your data and create targeted lists

At ActBlue we’re committed to dreaming up new ways for you to visualize all of your data. In that spirit, our tech team just rolled out a new way for campaigns and organizations to analyze contributor data by tracking HPCs (highest previous contributions) and total contribution amounts by individual donors.

This new visualization provides amazing insight not just for technically skilled data people, but also for smaller campaigns and organizations. The chart allows users to select groups of donors based on their donation history and download their email address and contributor information. That means teams that don’t currently have the ability to segment and target donors can now instantaneously create a segmented email list!

Here’s how it works:

If you navigate to the “Donors” tab of your Dashboard (previously called the “Uniques” tab), you’ll see a graph that looks something like this:

You can also select “Total Contribution Amount” in the “Show your donors by” menu to see this:

You can toggle between the HPC and total contribution views of the graphed data at the top. This chart has a log scale for both its x and y axis. A log scale increases by an order of magnitude, rather than a fixed amount, which allows us to present a clearer picture of your data. What does that mean? If you look at the x-scale, you’ll see there’s more space between 1 contribution and 2 contributions than there is between 8 and 9.  There are far more people who gave just 1 contribution, but on a regular graph all those dots would be stacked on top of each other. So, by choosing a log scale, we’re able to show you more of your actual data.  

The x-axis shows the total number of contributions a donor has made to your campaign or organization in their lifetime on ActBlue. The y-axis shows the donor’s highest previous contribution, or the total amount that they’ve donated, depending on which view of the chart you are looking at.

Values on the y-axis are rounded. For values from $1 to $5, amounts are rounded to the nearest dollar. For $5 to $25, they are rounded to the nearest $5, and from there on, tens are rounded to tens, hundreds to hundreds, and so on.*

For values on the x-axis, contribution numbers above ten are rounded to the nearest ten. It’s unlikely that you have contributions ranging in the hundreds, but in that case they are rounded to the nearest hundred.

The graph itself gives an insight to highest previous contributions for your entire donor base, along with information on how many donations people have made. But it doesn’t end there. Click a dollar value or a number of donations to highlight a row or column. You can switch to the other visualization to see how their total volume corresponds.

More importantly, you can download a CSV of the email addresses and the corresponding contribution data from a column, row, or selected range of the graph. To select a column or row, just click on the corresponding value and click “Download selected” in the upper right hand corner. 

To select a custom set of data, you can drag your mouse to draw a box around your desired values and then download the data.

This allows you to do some pretty sophisticated targeting without needing to do the backend work. You can easily target donors based on their highest previous contribution and frequency of donating without knowing a line of SQL.

For example, we’ve seen a lot of success in our program by targeting donors based on their HPC. For low-dollar donors, we’ll ask them for $5, while higher-dollar donors are asked for $10 or $15. With this chart, you could download a list of $3 and $5 donors and then send a personalized ask to that group. If you have a big enough email list, you could try sending a $5 ask and a $7 ask, to see if donors would be willing to give just a bit more.

You can also toggle the graph to show outliers (people who fall outside the scale of the graph), if you’re interested in targeting those donors.

We hope that this new tool will allow you to get to know your donors in a more nuanced way and run an even better email program.

If you have questions about applications or how to read the graph, we’re happy to answer them. Just drop us a line at info@actblue.com


*We chose this rounding scheme to simultaneously maximize the granularity of useful information and minimize unimportant visual clutter.

The Multi-armed Bandit: New and much improved A/B testing tools

The A/B test tool on ActBlue, which allows you to test out contribution form titles and pitches, among other variables, has gotten a significant upgrade, just in time for campaign season.

The old A/B testing tool worked great, but it also forced you to wait around for both test variations to get enough traffic to gain statistical significance. If one version was performing way better than the second one, that meant you were losing out on potential contributions in order to gain valuable insight.

This is how most A/B testing tools work, and it’s a good system. But with the new ActBlue testing tools, which use a more advanced statistical algorithm than typical A/B testing, you can still achieve statistical significance without having to sacrifice a ton of traffic to a losing form.

As the test runs and one variation begins performing better, we’ll start sending more traffic to that form, roughly in proportion to how they’re trending. You can see the traffic allocation listed just above each variation on the “A/B Test” tab of your contribution form. The traffic allocation will change continuously as donations come in. It’s important to note that if a variation is receiving 75% of the traffic, that does not necessarily mean it’s conversion rate is 3X as high as the other variation(s). If you’re curious what it actually does mean and want to talk complicated stats, you can get in touch with us here.

If there was a false positive and the losing form starts doing better, the traffic allocation will begin to reverse. The test will continue to run indefinitely until you click “Make Winner.” The A/B testing tool will eventually send 100% of volume to the winner if you don’t make either version the winner manually.

The new A/B testing tool makes your tests more efficient, which means you can try out more of them. If you have radically different language you want to try on a form, alongside three more standard pitches, there’s little risk. If it doesn’t work out, we’ll send fewer and fewer people to that losing form.

We wanted to give special thanks to Jim Pugh from ShareProgress for sharing notes on the multi-armed bandit method used in their software and helping us out with building this tool (and for hanging out in the ActBlue office for a week)!

As always, let us know what tests you’re running and what’s working for you at info@actblue.com!

Losing is awesome

Here at ActBlue, we’re always optimizing our contribution form by testing different variations against each other to see which performs best. And, whenever possible, we like to share our results. Needless to say, it’s great to discuss tests that end up winning; every percentage point increase in conversion rate we bring to our contribution form benefits every committee — of which there are currently over 11,000 active — that fundraises on ActBlue.

A very important part of this process, however, is also tests that fail to bring about a positive change to our contribution form. Failure to openly discuss and reflect upon losing tests belies the experimental nature of optimization. Thus, I’m here to talk about an A/B test that we just ran on our contribution form that lost. (Bonus: it lost twice!)

We tried coalescing our “First name” and “Last name” fields into one “Full name” input. The theory was that one fewer input would reduce friction along the contribution path, thereby increasing conversions. Here’s what it looked like:

Control
Variation

The control version, it turns out, was actually associated with a higher conversion rate than the “Full name” variation, though not statistically significantly.1 We even tested another slight variation of the “Full name” field with slightly different placeholder text and a more expressive label, but it lost again.

If you’re wondering why it lost, then that makes two of us; in a case like this, it’s tough to say what actually happened. Was it aesthetics? Anti-novelty effect? If we speculate like this ad infinitum, we’ll end up with more questions than answers — the world is full of uncertainty, after all. Far from discouraging this type of reflection, I’m saying that we indeed should! This is the origin story of many new testing ideas.

Footnotes:

1: Pr(>|t|) > .05 , n = 63159

Tandem Contribution Forms Just Got WAY Better

Our team is always thinking through ways to make our contribution forms easier to fill out and more streamlined. When donors have too many options and abandon a form, that’s known as choice paralysis. Eliminating that choice paralysis is a big part of building better contribution forms.

Tandem contribution forms list multiple candidates, which require more decisions to be made by donors. But the vast majority of people choose to just split their contribution evenly between all the candidates on the form. That used to look like this:

Too many options and too many boxes for our liking. Do you want to give more to candidate A than organization B? How much do you want to give in total?

We boiled the form down to that last question — how much do you want to give? This made it a lot easier for donors to give (spoiler alert: this A/B test was a huge success).

Now, when you land on a tandem form, you’ll see the normal amount buttons with a note underneath saying who the donation will be split among. You can still click a button to allocate different amounts to each candidate, but donors are less overwhelmed when they land on the page.

Here’s the new form:

So how successful was our A/B test? We saw a 7.16% overall improvement in conversion. That’s unheard-of-huge. We’ve done so many optimizations of our forms that we cheer for a test that leads to a 0.5% increase in conversions.

Part of that overall group consisted of non-Express users (people who haven’t saved their payment information with us) who land on our traditional multi-step form. Among that group we saw a 26% improvement in getting people to move from the first step of the process (choosing an amount to give) to the second step (entering their information).

There are so many candidates and organizations running really thoughtful tandem fundraising campaigns, and this is going to mean a huge bump for them. If you have questions, or want to tell us about a tandem campaign you’ve run, let us know at info AT actblue DOT com. We want to hear from you!

It’s crunch time so optimize those weekly recurring asks!

We’re fewer than six weeks from the election. That means, among other things, that optimal fundraising strategies become even more important than usual. Here at ActBlue, we’ve been running tests on a nearly daily basis on all kinds of Express Lane strategies.

Typically, we see the largest (statistically significant) improvements when optimizing factors related to the Express Lane askblock structure like amounts, number of links, and intervals between the links. For our own list, we find that, statistically speaking, the flashier aspects you see in some fundraising emails — emojis in subject lines, e.g. — do not do much (if anything) to improve donation outcomes. Here’s a tactic we recently tested, though, that’s a bit more on the fun side of things and definitely brought in a lot more money.

A little while ago, we started using our weekly recurring feature to great success. (By the way, if you haven’t tried this feature yet, shoot us an email at info [at] actblue [dot] com and we’ll turn it on for you.) After testing which amounts brought in the most money, we landed on this1:

We wanted to see if we could raise more money by asking for “$7 because there are 7 weeks until the election!” Gimmicky? Sure, but we had a hunch that it would perform well.2 Here’s what it looked like:

So what happened? The segment with the ‘7 for 7’ ask performed much better than the control; it brought in 87.6% more money, a statistically and practically significant improvement.3 Cool!

What’ll be interesting to me is to see when this tactic will lose its optimality. The key factor is that $7 (with gimmick) performed better than $10 (the control and previously optimal ask amount) despite it being a lower dollar amount. Though, at some point, a too-low number-of-weeks-to-election-dollar-ask-amount combination will negate the positive c.p. effect of the gimmick. Based on other testing we’ve done, my guess is that that will be at 4-weeks-$4. We’re doing follow-up testing on this “n weeks until the election!” tactic, so we’ll see!

If you decide to test something similar, send me an email and we can chat! Emails to info [at] actblue [dot] com with my name in the subject line will be directed to me.

P.S. Doing a lot of testing in the election run-up? Want a tool to help you manage your test groups? I wrote something in R for you! I’ll post something on the blog about it soon, but if you want it in the meantime, shoot me a note (emails to info [at] actblue [dot] com with my name in the subject line will be directed to me).

FOOTNOTES:

1 Actually, we built a model that predicts how a given Express user will respond to different types of donation requests based on previous donation information. Using those predicted values, we decide what type of donation ask they receive (of one-time, weekly recurring, monthly recurring) and for how much money they are asked. Math! The point: this is what we landed on for a certain subset of our list.

2 Of course, all else equal, it’s tough to distinguish whether any difference was due to the gimmick or because $7 is lower than $10. The theory would be that with a lower amount, more people would give, and even though the mean donation amount would likely be lower, the increase in number of donors would outweigh the decrease in mean donation size. This is definitely possible, but so is the opposite; it’s all about finding the optimal point.

In fact, we included a segment in the test which received an askblock starting with a lower amount and saw this dynamic in action, though the overall treatment effect was not statistically significantly different from the control. This lends support for interpreting the effect from the gimmick segment as the gimmick per se, but a detailed discussion is excluded from the body of the post for the sake of brevity. More rigorous follow-up testing on this “n weeks until the election!” tactic is already in the field— shoot us an email to chat!

3Pr(>|t|) < .01, controlling for other significant factors, including previous donation history.

Weekly recurring is back baby!

We’re less than 8 weeks out from Election Day and are now making the weekly recurring feature available to campaigns and organizations. Just drop us a line at info [AT] actblue [DOT] com and we’ll turn it on for you.

Yep, weekly recurring is exactly what it sounds like. You can ask your donors to sign up to make a recurring contribution that processes on that same day of the week every week until Election Day. After Election Day, the recurring contribution automatically ends.

So, if you get someone to sign up today for a weekly recurring contribution, they’d then have 7 more contributions scheduled to process every Friday.

Election Day is getting closer and closer though, so if you’re going to use weekly recurring, we suggest getting started soon.

Once we turn on the feature for you, create a new contribution form and open the “Show recurring options” section in the edit tab. You will see a new option there for weekly recurring. Make sure you also turn off popup recurring if you have it enabled — these two features aren’t compatible (yet!).

It looks like this:

We’ve run a few tests on weekly recurring this week with our own email list and have had a good deal of success. As always, a donor needs to know exactly what amount and for how long they’ll be charged before they click a link. If you’re going to use weekly recurring with Express Lane (and you should!), here is the disclaimer language we used and recommend you use as well:

Based on our testing, certain segments of your list will respond better than others to a weekly recurring ask (not exactly a shocking revelation). We sort our list into those likely to give to a recurring ask and those who are more likely to give a one-time gift. For the recurring pool, the weekly ask has been performing strongly. Unsurprisingly, the same can’t be said for our one-time folks.

Test it out with the portion of your list that is more likely to give recurring gifts. And try fun things like offering a small package of swag like bumper stickers in return for signing up for a weekly recurring gift.

And if you find an angle that’s working really well for weekly recurring, let us know!

Testing Basics: Let’s talk sample size

This post is the third in our blog series on testing for digital organizers. Today I’ll be talking a bit about what an A/B test is and explain how to determine the sample size (definition below) you’ll need to conduct one.

Hey, pop quiz! Is 15% greater than 14%?

My answer is “well, kind of.” To see what I mean, let’s look at an example.

Let’s say you have two elevators, and one person at a time enters each elevator for a ride. After 100 people ride each elevator, you find that 15 people sneezed in elevator 1, and 14 people sneezed in elevator 2.

Clearly, a higher percentage of people sneezed in elevator 1 than elevator 2, but can you conclude with any certainty that elevator 1 is more likely to induce sneezing in its passengers? Or, perhaps, was the difference simply due to random chance?

In this contrived example, you could make a pretty good case for random chance just with common sense, but the real world is ambiguous so decisions can be trickier. Fortunately, some basic statistical methods can help us make these judgments.

One specific type of test for determining differences in proportions1 is commonly called an A/B test. I’ll give a simple overview of the concepts involved and include a technical appendix for instructions on how to perform the procedures I discuss.

Let’s recall what we already said: we can perform a statistical test to help us detect a difference (or lack thereof) between the action rate in two samples. So, what’s involved?

I’ll skip over the nitty-gritty statistics of this, but it’s generally true that as the number of trials2 increases, it becomes easier to tell whether the difference (if there’s any difference at all) between the two variations’ proportions is likely to be real, or just due to random chance. Or, slightly more accurately, as the number of trials increases, smaller differences between the variations can be more reliably detected.

What I’m describing is actually something you’ve probably already heard about: sample size. For example, if we have two versions of language on our contribution form, how many people do we need to have land on each variation of the contribution form to reliably detect a difference (and, consequently, decide which version is statistically “better” to use going forward)? That number is the sample size.

To determine the number of people you’ll need, there are a few closely related concepts (which I explain in the appendix), but for now, we’ll keep it simple. The basic idea is that as the percent difference between variations you wish to reliably detect decreases, the sample size you’ll need increases. So, if you want to detect a relatively small (say, 5%) difference between two variations, you’ll need a larger sample size than if you wanted to be able to detect a 10% difference.

How do you know the percent difference you’d like to be able to detect? Well, a good rule of thumb to start with is that if it’s a really important change (like, say, changing the signup flow on your website), you’d want to be able to detect really small changes, whereas for something less important, you’d be satisfied with a somewhat larger change (and therefore less costly test).

Here’s what that looks like:

Sample Size Graph
Required sample size varies by the base action rate and percent difference you want to be able to reliably detect. Notice the trends: as either of those factors increases, holding all else equal, the sample size decreases.

For example, if you’re testing two versions of your contribution form language to see which has a higher conversion rate, your typical conversion rate is 20%, and you want to be able to detect a difference of around 5%, you’d need about 26k people in each group .

For instructions on how to find that number, see the appendix below. Once you have determined your required sample size, you’ll be ready to set up your groups and variations, run the test, and evaluate the results of your test. Each of those will be upcoming posts in this series. For now, feel free to email info [at] actblue [dot] com with any questions!

Footnotes:
1 Note that this should be taken strictly as “proportions”. Of course, there are many things to be interested in other than the percentage of people who did an action vs. didn’t (e.g., donated vs. didn’t donate), like values of actions (e.g., contribution amounts), but for now, we’ll stick to the former.
2I.e., the number of times something happens. For example, this could be the number of times someone reaches a contribution form.

Appendix:

Statistics is a big and sometimes complicated world, so I won’t explain this in too much detail. There are many classes and books that will dive into the specifics, but I want you to have a working knowledge of a few important concepts you’ll need to complete an accurate A/B test. I’m going to outline four closely related concepts necessary for determining your sample size, and walk through how to find this number. Even though I’m sticking to the basics, this section will be a bit on the technical side of things. Feel free to shoot an email our way with any questions; I’m more than happy to answer any and all.

Like I said, there are four closely related concepts when it comes to this type of statistical test: significance level, power, effect size, and sample size. I’ll talk about each of these in turn, and while I do, remember that our goal is to determine whether we can reject the assumption that the two versions are equal (or, in layman’s terms, figure out that there is a real statistical difference between the two versions).

Significance level can be thought of as the (hopefully small) likelihood of a false positive. Specifically, the probability that you falsely reject the assumption that the two versions are equal (i.e., claim that one version is actually better than the other, even if it’s not.) When you hear someone talk about a p-value, they’re referencing this concept. The most commonly used significance level is 0.05, which is akin to saying “there’s a 5% chance that I claim a real difference, but there’s actually not”.

Power is the the probability that you’ll avoid a false negative. Or said another way, the probability that if there’s a real difference there, you’ll detect it. The standard value to use for this is 0.8, meaning there is an 80% chance you’ll detect it; though there are really good reasons for adjusting this value. 0.8 is by no means always the best value to choose for power; it’s generally a good idea to change it if you know exactly why you’re doing what you’re doing. .08 will work for our purposes, though. Why not just pick a value of .9999, which is similar to saying “if there’s a real difference, there’s a 99.99% chance that I’ll detect it”? Well, that would be nice, but as you increase this value, the sample size required increases. And sample size is likely to be the limiting factor for an organization with a small (say, fewer -than-100k-member) list.

Effect Size. Of the two versions you’re testing against each other, typically you’d call one the ‘control’ and the other the ‘treatment’, so we’ll use those terms. Effect size is saying, what do you expect the proportion of actions (e.g., contributions) to be for the control, and what do you expect it to be for the treatment? The percent difference is the effect size. How this affects sample size is demonstrated in the above graph. But the whole point of running this test is that you don’t know what the two proportions will be in advance, so how can you pick those values? Well, actually, you estimate what your base action rate will be. For example, if your donation rate from an email is typically 5%, then you can use that as your base action rate. Then, for the second proportion, pick the smallest difference you’d like to be able to detect. Similarly to power, you might find yourself asking “well why wouldn’t I just pick the smallest possible difference?”. Again, the answer is that as you decrease the magnitude of the difference, the sample size you need will increase.

Finally, we have sample size, or the number of people we need to run the test on. If we have values for the above three things, we can figure out how big of a sample we need!

So how do we do that? Well, there are many ways to do it, but one of the easiest, best, and most accessible is R. It’s free, open-source, and has an excellent community for support (which really helps as you’re learning). Some might ask, “well that has a relatively high learning curve, doesn’t it? And, isn’t there some easier way to do this?” The answer to both of those questions is “maybe,” but I’ll give you everything you need in this blog post. There are also online calculators of varying quality that you can use, but R is really your best bet, no matter your tech level.

Doing this in R is actually pretty simple (and you’ll pick up another new skill!). After you download, install, and open R, enter the following command:

power.prop.test(p1 = 0.1, p2 = 0.105, sig.level = 0.05, power = 0.8)

and press enter. You’ll see a printout with a bunch of information, but you’re concerned with n. In this example, it’s about 58k. That number is the sample size for each group you’d need to detect, in this case, a 5% difference at a significance level of 0.05, a power of 0.8, and a base action rate of 10%. So, just to be certain we’re on the same page, a quick explanation:
p1: Your ‘base action rate’, or the value you’d expect for the rate you’re testing. If you’re donation rate is usually 8%, then p1 = 0.08
p2: Your base action rate plus the smallest percent difference you’d like to be able to detect. If you only care about noticing a 10% difference, and your ‘base action rate’ is 8%, then p2 = 0.088 (0.08 + (0.08 * 0.10))

Of course, your base action rate will likely be different, as will be the percent difference you’d like to be able to detect. So, substitute those values in, and you’re all set! Playing around with different values for these can help you gain a more intuitive sense of what happens to the required sample size as you alter certain factors.