Archive

Author Archives: Ross Martin

This post is the fourth in our blog series on testing for digital organizers. Today I’ll be talking about implementing your A/B test. This post will be full of helpful, quick tips.

So, we’ve discussed some things you might want to test, and some other things you might not want to test. Then, we walked through a simple way to figure out the number of people you’ll need in each of your test groups, which number depends on the smallest difference you’d like to reliably detect.1 Now what?

Well, the short answer is “run the test”, but of course it’s never that simple. Your next specific steps depend on what you’re testing, as well as which platform you’re using to run the test. There are too many possibilities for me to go through each one, but I can provide a few quick tips that should apply to you regardless of your specific situation.

First, make sure you have a reliable method of tracking your variations’ performance (like reference codes or an A/B testing tool (here are instructions for using ours)), and make sure you actually implement that method. This may sound like a no-brainer, but we’ve seen plenty of people start what would otherwise be an excellently set-up test with nothing to measure the variations’ relative performance! Is there a joke here about the “results” of the test?

Groaners aside, pointing out that error isn’t at all to make fun of the people who have committed it. Rather, we’re all busy, and things can get hectic. Having this on your pre-send checklist2 will save you from the realization that a lot of time spent thinking up a test, creating the content, and so on ad nauseam was all for naught.

What’s an example? Well, say you’re testing email content for donations. And of course, you want to use the best online fundraising software in the whole wide world, so you’re using ActBlue. Well, we have a handy feature that allows you to generate reference codes to track donations. We have a full instruction guide for using reference codes on our tutorial, found here. If you’re testing two different versions of your email, you could attach the URL param3 refcode=variation_a to the links in your first email and refcode=variation_b to those in your second email. Then, when you go to https://actblue.com/pages/[YOUR_PAGE_NAME]/statistics, you can measure the performance of each email. The information will also appear in a .csv download of your contribution form donations.

We also allow a handy refcode2 URL param if you want to conveniently subdivide your tracking. Conceptually, it’s the exact same thing as refcode; it’s value lies in the fact that it’s an extra place to store information. Think of a backpack with an extra divider on the inside for sorting your stuff. This is the internet version of that. For example, we use this for tracking link placement in the email. The need for refcode2, however, indicates that your test might be a bit complicated (i.e., there are more than just two variations, so setup and evaluation of the test is a bit outside the scope of the tips in this testing series.) That’s no problem, but you might want to shoot us an email at digital [at] actblue [dot] com to have a chat about test setup and design.

My second tip is related to groups. Taking your list—or some subset of your list—and dividing it up into smaller, randomized groups is a step that you’ll likely do in your CRM or email tool. Unfortunately, I can’t provide detailed instructions for each one. Chances are, though, that your CRM has an instruction page on how to do this within their software.4 In any case, this step is critical: without at least randomizing before conducting your trial, you’re setting yourself up for failure.

Here’s an example of how to do it wrong: let’s say you’re testing two emails, and even though you’re not sure which one is better, you have a hunch that email B is better than email A. So, not wanting to lose out on money, you decide to assign 20,000 people with the highest previous donations to group B and 20,000 people with the lowest previous donations to group A. That way, you can conduct the test to find out which email is definitely better, but not have to lose too much money along the way, right? Well, that’d be great, but unfortunately it’s all wrong. Assigning your groups that way would all but ensure you draw false conclusions about your test–email B is all but certain to bring in more donations, but it’s because it was assigned high-propensity donors, not necessarily because it’s the better email. Make sure you’re at least randomizing (with a proper algorithm, q.v. footnote 4) before splitting your groups and implementing your test.

My third tip is short and sweet. After you do all of this legwork, how do you know that the right variations were sent to the right number of people? What if you’re working with eight groups instead of just two? Well, the answer is that you don’t really. But, that can (and should!) be remedied. Place your own email address in each of the test groups. This won’t significantly affect the results of the test, but it will allow you to be sure that the right variations were sent. “But, I only have one email address, how can I put myself in multiple test groups without the hassle of creating new emails?”, you ask. Use the old email-campaigner’s trick of adding a “+” to your email address if you have a Gmail-based address. For example, if your email address is janesmith@actblue.com, you can add janesmith+test_email_a@actblue.com to group A and janesmith+test_email_b@actblue.com to group B; they’ll both be delivered to your inbox, and you’ll be able to perfectly spot whether the variations were sent correctly.

My fourth and last tip of the day is the most important one of all. Remember going through the process of determining your required sample size? Well, we did that for a (lengthily explained) reason. Don’t deviate from that now. What the hell am I talking about? I’m talking about peaking at the results too early (viz., before you reach your necessary sample size.)

I get it. You spent a lot of time setting up a test for these awesome variations of, say, a contribution form, and even though you know you need to wait until 15,000 people land on the form to see results, you want to check what’s happening? Has either taken an early lead? etc., etc., etc.

You can check what’s happening along the way, but you should definitely not stop the test early because it looks like one variation is performing better.5 This is a really common mistake, but a deadly one. I can’t stress this enough. The more times you test two variations for significance (which we’ll talk about in a future post) before the required sample size is hit, the more likely you are to detect a false positive. In fact, you can pretty quickly render your test effectively useless. So, if you just have to see what’s going on, fine, but promise yourself and statisticians everywhere that you won’t act on what you see!

Ok, that’s it for today! Next we’ll talk about evaluating your results and even more importantly, learning from them!

FOOTNOTES:
1 as well as your tolerance for the probability of getting a false positive and false negative, though using standard values can take some of the difficulty of this decision making away

2 which, if you don’t have a pre-send checklist (we prefer old-fashioned paper, big check boxes, and sharpies!), you should make one ASAP

3 A way of passing messages from the URL back to the website which it can use to customize the display or data recorded.

4 Now, this is generally the most basic possible insurance for proper group setup, as most tools will do nothing more than randomize and divide. There are other steps that should be taken for running anything more complex than a simple A/B test, which steps tend to best be done with a statistical tool such as R. If you think something more complicated is in-line for your program, don’t hesitate to shoot us an email (digital [at] actblue [dot] com)– we’d love to work with you to see if something more complicated is in order, and if so, we’d be glad to help.

5 Saying “definitely” in a conversation about statistics is— if also delightfully ironic— a bit misleading. This is actually a really complicated topic with plenty of proffered solutions, which range from minor adjustments in your calculations to an entirely different philosophical approach to statistics (I mean, who knew, right?). Those are all great discussions to have, but for now, it’s probably best to just assume you shouldn’t repeatedly evaluate your test variations before you hit your required sample size. Ok? Cool.

This post is the third in our blog series on testing for digital organizers. Today I’ll be talking a bit about what an A/B test is and explain how to determine the sample size (definition below) you’ll need to conduct one.

Hey, pop quiz! Is 15% greater than 14%?

My answer is “well, kind of.” To see what I mean, let’s look at an example.

Let’s say you have two elevators, and one person at a time enters each elevator for a ride. After 100 people ride each elevator, you find that 15 people sneezed in elevator 1, and 14 people sneezed in elevator 2.

Clearly, a higher percentage of people sneezed in elevator 1 than elevator 2, but can you conclude with any certainty that elevator 1 is more likely to induce sneezing in its passengers? Or, perhaps, was the difference simply due to random chance?

In this contrived example, you could make a pretty good case for random chance just with common sense, but the real world is ambiguous so decisions can be trickier. Fortunately, some basic statistical methods can help us make these judgments.

One specific type of test for determining differences in proportions1 is commonly called an A/B test. I’ll give a simple overview of the concepts involved and include a technical appendix for instructions on how to perform the procedures I discuss.

Let’s recall what we already said: we can perform a statistical test to help us detect a difference (or lack thereof) between the action rate in two samples. So, what’s involved?

I’ll skip over the nitty-gritty statistics of this, but it’s generally true that as the number of trials2 increases, it becomes easier to tell whether the difference (if there’s any difference at all) between the two variations’ proportions is likely to be real, or just due to random chance. Or, slightly more accurately, as the number of trials increases, smaller differences between the variations can be more reliably detected.

What I’m describing is actually something you’ve probably already heard about: sample size. For example, if we have two versions of language on our contribution form, how many people do we need to have land on each variation of the contribution form to reliably detect a difference (and, consequently, decide which version is statistically “better” to use going forward)? That number is the sample size.

To determine the number of people you’ll need, there are a few closely related concepts (which I explain in the appendix), but for now, we’ll keep it simple. The basic idea is that as the percent difference between variations you wish to reliably detect decreases, the sample size you’ll need increases. So, if you want to detect a relatively small (say, 5%) difference between two variations, you’ll need a larger sample size than if you wanted to be able to detect a 10% difference.

How do you know the percent difference you’d like to be able to detect? Well, a good rule of thumb to start with is that if it’s a really important change (like, say, changing the signup flow on your website), you’d want to be able to detect really small changes, whereas for something less important, you’d be satisfied with a somewhat larger change (and therefore less costly test).

Here’s what that looks like:

Sample Size Graph

Required sample size varies by the base action rate and percent difference you want to be able to reliably detect. Notice the trends: as either of those factors increases, holding all else equal, the sample size decreases.

For example, if you’re testing two versions of your contribution form language to see which has a higher conversion rate, your typical conversion rate is 20%, and you want to be able to detect a difference of around 5%, you’d need about 26k people in each group .

For instructions on how to find that number, see the appendix below. Once you have determined your required sample size, you’ll be ready to set up your groups and variations, run the test, and evaluate the results of your test. Each of those will be upcoming posts in this series. For now, feel free to email info [at] actblue [dot] com with any questions!

Footnotes:
1 Note that this should be taken strictly as “proportions”. Of course, there are many things to be interested in other than the percentage of people who did an action vs. didn’t (e.g., donated vs. didn’t donate), like values of actions (e.g., contribution amounts), but for now, we’ll stick to the former.
2I.e., the number of times something happens. For example, this could be the number of times someone reaches a contribution form.

Appendix:

Statistics is a big and sometimes complicated world, so I won’t explain this in too much detail. There are many classes and books that will dive into the specifics, but I want you to have a working knowledge of a few important concepts you’ll need to complete an accurate A/B test. I’m going to outline four closely related concepts necessary for determining your sample size, and walk through how to find this number. Even though I’m sticking to the basics, this section will be a bit on the technical side of things. Feel free to shoot an email our way with any questions; I’m more than happy to answer any and all.

Like I said, there are four closely related concepts when it comes to this type of statistical test: significance level, power, effect size, and sample size. I’ll talk about each of these in turn, and while I do, remember that our goal is to determine whether we can reject the assumption that the two versions are equal (or, in layman’s terms, figure out that there is a real statistical difference between the two versions).

Significance level can be thought of as the (hopefully small) likelihood of a false positive. Specifically, the probability that you falsely reject the assumption that the two versions are equal (i.e., claim that one version is actually better than the other, even if it’s not.) When you hear someone talk about a p-value, they’re referencing this concept. The most commonly used significance level is 0.05, which is akin to saying “there’s a 5% chance that I claim a real difference, but there’s actually not”.

Power is the the probability that you’ll avoid a false negative. Or said another way, the probability that if there’s a real difference there, you’ll detect it. The standard value to use for this is 0.8, meaning there is an 80% chance you’ll detect it; though there are really good reasons for adjusting this value. 0.8 is by no means always the best value to choose for power; it’s generally a good idea to change it if you know exactly why you’re doing what you’re doing. .08 will work for our purposes, though. Why not just pick a value of .9999, which is similar to saying “if there’s a real difference, there’s a 99.99% chance that I’ll detect it”? Well, that would be nice, but as you increase this value, the sample size required increases. And sample size is likely to be the limiting factor for an organization with a small (say, fewer -than-100k-member) list.

Effect Size. Of the two versions you’re testing against each other, typically you’d call one the ‘control’ and the other the ‘treatment’, so we’ll use those terms. Effect size is saying, what do you expect the proportion of actions (e.g., contributions) to be for the control, and what do you expect it to be for the treatment? The percent difference is the effect size. How this affects sample size is demonstrated in the above graph. But the whole point of running this test is that you don’t know what the two proportions will be in advance, so how can you pick those values? Well, actually, you estimate what your base action rate will be. For example, if your donation rate from an email is typically 5%, then you can use that as your base action rate. Then, for the second proportion, pick the smallest difference you’d like to be able to detect. Similarly to power, you might find yourself asking “well why wouldn’t I just pick the smallest possible difference?”. Again, the answer is that as you decrease the magnitude of the difference, the sample size you need will increase.

Finally, we have sample size, or the number of people we need to run the test on. If we have values for the above three things, we can figure out how big of a sample we need!

So how do we do that? Well, there are many ways to do it, but one of the easiest, best, and most accessible is R. It’s free, open-source, and has an excellent community for support (which really helps as you’re learning). Some might ask, “well that has a relatively high learning curve, doesn’t it? And, isn’t there some easier way to do this?” The answer to both of those questions is “maybe,” but I’ll give you everything you need in this blog post. There are also online calculators of varying quality that you can use, but R is really your best bet, no matter your tech level.

Doing this in R is actually pretty simple (and you’ll pick up another new skill!). After you download, install, and open R, enter the following command:

power.prop.test(p1 = 0.1, p2 = 0.105, sig.level = 0.05, power = 0.8)

and press enter. You’ll see a printout with a bunch of information, but you’re concerned with n. In this example, it’s about 58k. That number is the sample size for each group you’d need to detect, in this case, a 5% difference at a significance level of 0.05, a power of 0.8, and a base action rate of 10%. So, just to be certain we’re on the same page, a quick explanation:
p1: Your ‘base action rate’, or the value you’d expect for the rate you’re testing. If you’re donation rate is usually 8%, then p1 = 0.08
p2: Your base action rate plus the smallest percent difference you’d like to be able to detect. If you only care about noticing a 10% difference, and your ‘base action rate’ is 8%, then p2 = 0.088 (0.08 + (0.08 * 0.10))

Of course, your base action rate will likely be different, as will be the percent difference you’d like to be able to detect. So, substitute those values in, and you’re all set! Playing around with different values for these can help you gain a more intuitive sense of what happens to the required sample size as you alter certain factors.

In this post, part of our testing blog series, I’ll talk a bit about some things you might consider testing, and—probably even more importantly—some things you might not want to test. This is all the more relevant if you’re managing a smaller list (say, fewer than 100k active members). As we’ll discuss in future posts, it takes huge sample sizes to reliably detect relatively small differences in two testing segments, so you’ll want to reserve your testing for factors that are likely to have larger effects on your goals, like subject lines, for example.

But to begin, we should be on the same page regarding why we test. It’s pretty simple. We tend to be pretty bad at guessing what will happen, so it’s often better to let data inform our decision making. For instance, when sending an email, should you go with a negative subject line like “This Republican is the worst!”, or a positive one like “Sally Jane is a great Democrat!”?

This trivial example allows us to demonstrate an important testing concept. Testing is only a tool; it’s not the final judge, nor does it say anything about the appropriateness of your content. If “This Republican is the worst!” isn’t in congruence with your campaign/organization’s messaging and mission, then you shouldn’t test that subject line, let alone use it for an email to your entire list.

So, then, assuming the subject matter is in-line with your messaging and mission, what’s something you should test, even with a small list? Subject lines could be one, but there are other things that could have a big impact on your action rates. What comes to mind first and foremost is email content.

By this I mean writing two completely different emails, whether they’re about the same thing or completely different concepts. The varied factor could be anything from your topic and theory of change to your tone and word choice. Even ostensibly similar emails—let alone drastically different ones—can yield very large differences in results. We at ActBlue, for example, regularly test at least three different fundraising emails for every one that we end up sending to our full list.

For one of our most recent email blasts, we sent four different email drafts, a couple of which were quite similar. The results? The best-performing draft brought in over triple the number of donations as the worst-performing drafts! So, here’s a clear case in which performing a simple test can lead to much higher action rates, whether you’re looking for signatures on a petition or donations to your cause.

content_testing_11

It might seem that writing three or more email drafts for every send is a bit much for a resource-constrained organizer. If that’s the case, you should still be message testing periodically, say, once a month or so. The goal here is to ascertain the biggest button-pushers for your list members. A standard example is testing the performance of an email highlighting the negative characteristics of your opponent against the performance of an email highlighting the endorsement of your candidate by a local community leader. This is a less resource-intensive way to gauge the temperature of your list and see what resonates with your list members.

So if the content of an email is something that is definitely worth testing, what are some things that small campaigns shouldn’t test? Well, anything that you expect won’t result in a large percentage difference between your test segments. For example, you certainly could test four differently colored donate buttons, but you shouldn’t.

Chances are, you won’t see a significant advantage in one of them over the others. How do I know this? Well, I can’t claim 100% certainty (nor can any honest analyst), but whenever we at ActBlue or some of our larger partners have tested something very small like this, we’ve seen that result.

For example, we wanted to run an A/B test1 on our contribution form to find out whether we could increase the conversion rate by removing the header, “Employment Information” above two of the FEC-required fields. To see what that looked like (and for some more A/B test examples), check out this blog post. We knew that it would take us close to 150,000 page views to reliably detect the small percentage difference in the two segments of the test we required to make a permanent change to our contribution form. I’ll talk more about determining required sample size in a later post, but for now, the point is that it took a lot to get a little.2 If you manage a smaller list, that means sending dozens of emails for a relatively minor gain, and that’s not worth your time.

Of course, context matters a lot, and in this case, context is your email program and your members. So, the final word is that if you really, really want to know, you should indeed test something for yourself instead of taking someone else’s word for it. But you’re much better off focusing on testing more meaningful factors (like your messaging) that are likely to result in clear and large differences. For the small things, you can learn from the organizations that have the resources to test small nuances. If you subscribe to numerous email lists, you’ll get a good gauge of what community best practices are at a given time.

Testing one email draft against another tells you exactly one thing: which (if either) is better. It doesn’t, however, tell you some things that can be quite valuable: Do members of your list tend to prefer positive emails or scare-to-action emails? Do they tend to respond well to fun, edgy language or slightly more formal language?

One A/B test won’t provide much of an answer to questions like that, but repeatedly testing two different email styles—like short, punchy emails against longer, more descriptive emails—over time can help you understand the style of communication your list members prefer, and therefore help you write emails with better action rates.

As you go on and develop your testing program, examining other questions like how much money to ask for in a fundraising email, how to best segment your list, and so on becomes more important and makes more sense from a cost-benefit perspective, too.

But to start, remember: make sure what you’re testing fits in with your organization’s messaging, plan a test that has a plausible chance of realizing big gains, and, more than anything else, work on honing your messaging. You’ll need to start out with bigger questions—and, therefore, more general tests—about your list members and eventually narrow down to the specifics.

The next post in our series about testing will talk about some essential factors involved in setting up a test, like setting up your groups and determining your required sample size. Expect that one to be published next week, after Netroots Nation.

Footnotes:

1 “A/B test” is an informal term for statistically testing two variations of some singular factor against each other in order to determine which, if either, is better for your desired outcome.

2We have millions of people land on our contribution forms each month, so for us, there’s a huge payoff to testing minor details that result in small percentage-point gains. It’s thousands of tests like this one over the years that make our contribution forms so successful. But this is our context— running a testing program with a small list is a totally different game.

We on the left have done a great job cultivating a “test, test, test” ethos, and while testing can result in big gains, it takes time and resources that digital organizers often don’t have. And for those working with a smaller list (say, fewer than 100k members or so), the challenges are even greater.

Don’t be discouraged, though; anyone can run an effective testing program, you just need to be aware of your organization’s specific circumstances. For instance, if you have a small list, it’s important to know that there are actually a lot of things that you shouldn’t test (more on this to come in future posts).

To help you get on track toward developing a strong testing program, we’re going to publish a series of blog posts, each focused on a particular aspect of digital testing for small organizations. We’ll be talking about anything from tools and techniques to case studies and getting buy-in from your supervisor.

If there are any specific issues you’d like to see addressed in this series of testing blog posts, please reach out! An email to info@actblue.com with a subject line “ActBlue Blog: Testing” will be directed my way.

A couple of weeks ago, Julia unveiled our new mobile-responsive contribution forms to the world. Since we’ve rolled out mobile-responsive forms, our mobile contribution numbers have been through the roof, so we’re really excited to share them with you.

Check out this graph, in which the red line represents the release date. Notice anything?

ActBlue mobile donation trends

ActBlue mobile donation trends

As we’ve mentioned, our initial A/B test yielded some excellent results: our new mobile-responsive forms led to a 49% boost in conversions (a statistically significant improvement at p< .01). And these forms are already making a marked difference.

Since the release, 21.9% of sitewide donations have been made by supporters using a mobile device. For ActBlue Express users– those who have saved their credit card information with us– the number’s even higher at a full 25.9% mobile. According to the stats textbooks I keep on my desk for reference, that number is “insanely high”.1 Seriously though, from the beginning of the year to the day our mobile-responsive contribution forms were released, 9.0% of donations were made via mobile devices (12.3% for Express users). It’s pretty tough to exaggerate how prodigious this jump is, and there’s clearly more growth to come.

The importance of mobile donations is increasing inexorably; we all know that. But, on one of the busiest days of the year, we topped over 30% mobile donations among ActBlue Express users. It’s a whole new world.

Footnotes:
1Just kidding, of course :-)

This week we officially announced Express Lane, and I’m guessing the fact that it can more than triple your money caught your eye. It can, and the way to raise more money is to learn Express Lane best practices and do your own optimization. We’re here to help you with both.

We’ve done a significant amount of Express Lane testing in our email blasts over the past few months to help you get started on what works– and what doesn’t– with Express Lane. Each email list, of course, is different, so you should probably test and expand upon the the takeaways below with your own list. And definitely let us know the results; we’d love to hear about them. It’d be especially great if you wanted to share your results here on the blog– just like the fantastic folks at CREDO Action were happy to do for this post– so that others can learn from your test results.

Here’s a little bit of background: our own email list consists entirely of donors, therefore it’s a pretty diverse group of folks. Also, we always fundraise to support our own infrastructure, not specific issues or candidates. Further, we spend most of our time optimizing for recurring donations because we’ve found them to be best for our organization, but much of what we say here also applies to one-time donation asks. We are, by the way, totally interested in collaborating with you on testing and optimization efforts– just give us a shout.

For this post, we’re going to discuss the gains you can expect from using Express Lane, results from some of the tests we’ve run on our Express Lane askblocks, and touch on stylistic concerns. Then, we’ll finish up with a summary of our recommendations and where you can go from here.

What to Expect

So, you probably expect to raise a lot more money using Express Lane, but what’s a typical increase? We’ve tested Express Lane vs. non-Express Lane on both recurring and one-time asks among randomly sampled Express users and seen Express Lane bring in more than triple the money for one-time1 asks, and 37.7% more for recurring asks (measured by money donated plus pledged recurring).

That’s quite a big boost, but other partners have seen significant gains, too. For example, here’s a test that was run by our friends at CREDO Action, some of our most sophisticated users. They tested a $5 control ask against a $5, $10, $25 Express Lane askblock. Their Express Lane version brought in 37.4% more donations than the control version. If you don’t see a noticeable increase in your testing, you should definitely reach out.

exp_lane_test_graph

Results from ActBlue’s April 2013 Express Lane test

Askblock Structure

We have an awesome Express Lane Link Creator tool for you, which you can find by clicking the “Express Lane” tab of your fundraising page. It’s really important that you use the language we provide there so that donors know that they’ll be charged instantly and why that’s possible– if you want to deviate from this, you’ll have to get our approval first. We do think, though, that you should stick with this language since it’s clear and concise.

But, how many Express Lane links should you include in the body of your email, and for what amounts? Should the intervals between amounts be equal? The answer to such questions will depend on your email list members but here are some suggestions, based on tests we’ve run, that should help get you on your way to optimizing your own Express Lane askblock structures!

One approach we’ve seen used by organizations in different contexts is what we refer to as a jump structure. The basic idea is that you set a large interval between the lowest link amount (which should be a low amount relative to your list’s average donation amount) and second-lowest link amount. Here’s an example we’ve used:

jump_example_image

Example jump structure

This relatively low-dollar link could encourage a much higher number of donations (if your jump structure amount is, for example, $4 instead of the $5 you’d usually use). This is because it’s a lower absolute dollar amount, but also a lower amount relative to the rest of the structure. Basically, the large jump between the lowest amount and the second-lowest amount makes the first one look small.

We’ve found that in general, this type of jump structure does indeed lead to a higher number of donations, but a lower overall amount of money, than the common structures which we used as controls. While it led to more donations, we didn’t see enough extra donations to outweigh the “cost” of the lower dollar amount and bring in more overall money. If you’re looking to bring in more low-dollar donations in the hopes of larger-dollar donations in the future, however, this might be a good strategy to try.

We’ve also looked at the effect of changing the lowest dollar amount in your ask block. In July, we tested the the following three askblock structures against each other:

Structure "A"

Structure “A”

Structure "B"

Structure “B”

Structure "C"

Structure “C”

Obviously, we were trying to see whether we could increase the total money we raised by increasing the amount of the bottom link2. The risk of this approach is that you might lose a certain number of donations by setting the lowest ask amount to be a little bit higher3.

We found that the by number of donations, A>B>C, but by overall money raised, C>B>A. The askblock labelled “C”, in fact, raised 21.1% more money than “A” (“B” raised 12.1% more than “A”), even though “A” brought in 15.3% more donations than “C”!

structure_test_graph

The “other amount” Link

A great thing about Express Lane is that users’ donations are processed once they click the link in your email body. However, as much as we try to structure our links perfectly, some donors are always going to want to do their own thing, and that’s okay. Enter the “other amount” link.

An “other amount” link doesn’t process the donation right away, it’s simply a normal ActBlue link that takes the user to your contribution page and allows them to choose a custom donation amount and/or recurring length. This is included as a default in our Express Lane Link Creator tool.

We at ActBlue focus on recurring donation asks because over the long run– and our goal is to be the best technology both today and years into the future– they bring in more money than one-time donation asks, even taking into account imperfect pledge completion rates. So, we worried at first that adding an “other amount” link might draw too many people toward giving one-time donations instead of more valuable recurring donations. But, we also know that it’s important to give people the option to choose their own donation amount, lest they not donate at all. This is why every ActBlue contribution page allows people to easily choose between a one-time donation and a recurring donation.

So we decided to test two things. First, we wanted to know whether the presence of an “other amount” link in our email body would lead to more/fewer donations. Actually, we were almost positive that getting rid of the “other amount” link would be a big loss, but we wanted to run the test anyway. That way, we could confirm this and make sure no one else has to lose money on the test. The result: don’t try this at home. The version which included the “other amount” link brought in 88.3% more money (90.6% more donations) than the version which did not. We’ll accept your thanks in the form of chocolate or wine. Just kidding! Our lawyers won’t allow that.

Second, we’ve performed several tests (and several variations thereof) of whether an “other amount” link which indicated that users could instead give a one-time donation would lead to more/fewer donations than an “other amount” link that made no mention of one-time donations. This matters to us because, as we mentioned, we focus mostly on recurring donation asks, and wanted to see whether we could retain people who would give a one-time donation, but might not know that it was possible.

Typically, an “other amount” link which mentions one-time contributions leads to a statistically significantly higher number of donations, but less overall money raised. While this setup might draw in some people who otherwise wouldn’t have given, it also pulls some would-be recurring donors into giving one-time donations, which bring in less money. This doesn’t mean that such language is a bad thing, but you should consider your fundraiser’s goals and organizational priorities while choosing your link language. If, for example your goal is to increase participation rather than raise as much money as possible, then mentioning one-time donations in your “other link” might be a good idea during a fundraiser focused on recurring donations.

No mention of one-time donations

No mention of one-time donations

With mention of one-time donations

With mention of one-time donations

Style

Stylistic elements of an email can often have a huge impact on your ask, and since Express Lane links are new, the presentation of them hasn’t yet been set in stone. We started sending emails with our Express Lane askblock simply as an HTML <blockquote> element. We wanted the Express Lane askblock to stand out and to be easily identified, though, so we devised a simple design to make it pop. We put our Express Lane askblock in a gray box and center-aligned the text4. It looked like this:

We tested this against our original structure among several different link structures, and the results were pretty interesting. Among link structures with 4 or 5 links (including “other amount”), the gray box boosted the amount of money raised by up to 37.7%.

Subtle Express Lane askblock styling

Subtle Express Lane askblock styling

The obvious concern is that some stylistic elements are really subject to novelty effects, and the initial boost in action rate will decline or disappear altogether in time. We think the gray box may be an exception, though. First, the gray box is pretty subtle, almost to the point of being too dull, so I doubt that it caused the fervor of a “Hey” subject line or manic yellow highlighting. Second, the box serves a legitimate function, i.e., to identify this new set of links that’s now appearing in emails as a single entity that stands out from the email content.

Where to go from here

You’ve seen how some slight changes– the link amounts, the intervals between them, the number of links, etc.– can seriously affect the performance of your Express Lane email ask. Hopefully, you’ve picked up some tips about how to structure your asks, as well as picked up a few ideas for testing that might prove fruitful for your own organization.

As progressive organizers, we all know how important participation and collaboration are. In this light, I encourage you to get in touch with us if you’d like to work together on running a test. Moreover, if you run a test with interesting results, we would love to hear from you so that we can share them with the larger ActBlue community.

Footnotes:

1N.B.: some of this money came from people giving recurring donations from the “other amount” link in our one-time ask.

2There could be an additional effect from having one fewer link in “C”, but our other testing indicates that this isn’t a particularly important factor.

3Think about it as if it’s a variation of the classic revenue maximizing problem, where Revenue = Quantity * Donation Amount. Of course, donors can still choose their amount by clicking the “other” link, but the suggested amounts do indeed impact behavior.

4style="background-color:#ECEDF0; padding:1.0em 1.5em; text-align:center;"

If you had hundreds of millions of lines of contribution data, what would you want to know? Well here at ActBlue, we have an insane amount of data, and we’re always looking to learn more about our donors and how they use our site.

So we recently recently posed the question:

Who donates more…men or women?

The answer turns out to be women, but only if you approach things from the right perspective.

Before I go on, I’d like to say that I by no means want to perpetuate the gender binary; everyone at ActBlue respects and values people all across the gender spectrum.

We all know some of the basic election gender data – more women went for Obama, more men for Romney. But, political contributions involve personal investment, so I wanted to see how it breaks down on our site, which is obviously exclusive to Democrats. There was just one hiccup in my data-nerd fantasy: we don’t collect any information on our donors’ gender identification.

The easiest way to get around this problem is to use approximate name-gender matching. While many databases available for this purpose are either costly, unreliable, or both, I did eventually find a source which I felt comfortable using (an academic paper available for free in which the authors explained their methodology). So after digging into our database and crunching the numbers, I came out with some answers. I’ll give an overview of my results first and then explain my methodology and some statistical issues I want to highlight in a bit more detail further down.

I found that for individual contributions, women give about 15.0% smaller dollar amounts than men do. I also found, however, that women are 12.4% more likely to make a recurring contribution than men are. (Assume all of these values are statistically significant, but if you’re interested read more on that below.)

So the obvious question was: what happens once you factor in future installments of a recurring contribution, and not just the initial dollar amount? I crunched the numbers again, but it turned out not to change anything — women still donated about 16.6% smaller dollar donations than men. This was a big surprise, so I started racking my brain for possible explanations.

You’ve probably already figured it out, but I made quite an oversight in my initial assumptions. It’s well documented that the gender wage gap still persists; 77 cents is a popular estimate for how much a woman earns for doing the same amount of work a man is payed one dollar to do. This is incredibly unjust, but it is also directly relevant to my project — women are unfairly earning less income than men, so it makes sense that they’d have less disposable income from which they are willing and able to make political contributions, all else equal.

So I did what every progressive has always dreamed of. I punched a few computer keys and voilà– the gender wage gap disappeared! After this adjustment for equality, women turned out to make about 12.9% higher dollar contributions than men, and when factoring in the entirety of recurring donations, they donated 11.4% more than men. Quite the change from my initial findings, indeed. (This kind of broad and general adjustment is bound to be approximate, but in my opinion it was actually a fairly conservative change. But, see below for some discussion of that.)

Given ActBlue’s focus on grassroots donors, I wondered what would happen if I trimmed my dataset to include only donations that were $100 or less. Well, I did that and was left with about 95% of my original sample, which really does demonstrate the extent to which ActBlue is all about small-dollar donations. After trimming the dataset (and continuing to use adjusted donation amounts), I found that women were donating higher dollar amounts than men to an even greater extent than before, at 21.1%!

As many of you know, ActBlue Express Accounts allow donors to securely store their payment information with us and donate with just one click. I found that women and men in my sample donated using an ActBlue Express Account at a remarkably similar rate– within 1 percentage point. This just goes to show how egalitarian ActBlue Express Accounts are!

Now there are several important takeaways here. It looks like on ActBlue, for example, women tend to donate higher dollar amounts than men (after adjusting for the gender wage gap), and also tend to give recurring contributions more often than men. But for me, the biggest lesson was to be vigilant about understanding what outside factors might be affecting the internal nature of your data.

Before I move on to some nitty-gritty technical comments, I want to say that I really did mean the question that opened this blog post. So, readers, what would you want to know if you had that much data? I really enjoyed sharing these results with you, so please shoot me a note at martin [at] actblue [dot] com to let me know what you’d like our team to dig into for the next post!

My discussion below is a bit more technical and intended for other practitioners or very curious general readers.

As I mentioned above, name-to-gender matching is difficult for several reasons. In “A Name-Centric Approach to Gender Inference in Online Social Networks”, C. Tang et al. combed Facebook pages of users in New York City and, after using some interesting techniques, came up with a list of about 23k names, each of which was associated with the number of times a user with that name identified as male and female. I definitely recommend reading through their study– you might not think it’s perfect, but it could provide some inspiration for the aspiring data miners among you. In any case, I then did some further pruning of their list for suitability reasons, the effects of which were minimal. I combined their name-gender list with a n=500k random sample of contributions made on ActBlue since 2010, matching only names that appear on both lists for obvious reasons.

At that point, I had a dataset that included, on a contribution-basis, the donor’s name, estimated gender (the authors of the study pegged their matching accuracy at about 95%), and some other information about the contribution. Of the 500k sample, the matching spat out about 50.4% females.

When I say “other information”, I’m specifically referring to factors that I know from past analyses directly affect contribution amount (for instance, whether the donor is an ActBlue Express User or not). I took this extra information since I knew I’d need to control for these factors when evaluating the effect of gender on donation amount. This is a good reminder of why it’s super important to know your data really well by staying current with trends and performing frequent tests– otherwise you might end up omitting important explanatory variables, choosing a misspecified model, or making other common mistakes.

With my dataset ready, I tried a few different types of models, but landed on one in which the dependent variable (contribution amount) was in logarithmic form, so it looked like:

ln(contribution_amount) = β0 + β1female + some other stuff + u

This model was best for a few different, yet boring (even for practitioners) reasons, so I’ll spare you the discussion :)

As I noted in my general discussion, all of the results I found were “statistically significant”, but there was an issue I wanted to address. In my case, yes, beta coefficients were significant at p<.0001, as was the overall significance of the regression and joint significance of groups of regressors I thought it important to test. But with n=500k, I think saying certain things were “statistically significant” can be a bit insincere or misleading if not explained properly, unless you’re talking to someone fairly comfortable with statistics. What I mean is pretty obvious if you just think about how a t statistic is actually computed, why it’s done that way, and what that means.

At huge sample sizes, very small differences can be “significant” at very high confidence levels, and lead to misinterpreting your results. Moreover, just because something is statistically significant doesn’t mean that it is practically significant. There are a few different ways to deal with this, none of which are perfect, though. In my case, I saw that 95% CIs of the regressor coefficients were really tight, and would certainly consider 10%-14% differences practically significant (don’t get me wrong—of course there are times when small differences like 0.3% can be practically significant, but this isn’t one of them). I’m not bashing large sample sizes here or saying that hypothesis testing is unimportant (it is!), but rather emphasizing caution and clarity in our reporting.

Further, there’s another important lesson here. Sometimes, no matter how cleverly we choose our models or carefully we conduct our analysis, the explanatory power of a regression is going to be limited because you simply don’t have enough data. I don’t mean depth of data (i.e. sample size), but rather the breadth of the data (i.e. categories of information). For instance, personal income is clearly going to be an important factor in determining the dollar amount of a given political contribution. We don’t, however, have that kind of information about donors. Does that mean I should have just thrown away the regression and called it a day? Of course not, because obviously partial effects can be estimated fairly precisely with very large sample sizes, even with relatively large error variance. Again, the lesson is to be judicious in your interpretation and reporting of results.

I also noted that I thought my gender wage gap adjustment was fairly conservative. What I did was simple; for all contributions in the dataset made by females, I calculated an “adjusted” contribution amount by dividing the actual contribution amount by 0.77. This implicitly assumes that if women were paid equally for equal work, they would contribute more overall dollars, but at their current ratio of donations/income. In other words, their marginal propensity to donate would be constant as income increases. In fact, I think this is probably false in reality, and women (and men, for that matter) would instead demonstrate an increasing marginal propensity to donate with increased income, and therefore I should have increased the contribution amounts by even more than I did. I haven’t, however, read any study that provides a reliable estimate of a marginal propensity to donate, and therefore decided it best to keep things simple.

I already asked you to reach out and tell me what you’re interested in knowing, but I’ll double down here: I would love to hear from you and get your input so that the next blog post will reflect our community members’ input! So shoot me an email me at martin [at] actblue [dot] com.

Follow

Get every new post delivered to your Inbox.

Join 42 other followers