Last week I was lucky enough to attend the Hewlett Foundation’s Quality Education in Developing Countries (QEDC) conference in Uganda, which brought together both Hewlett-funded organizations running education interventions and outside researchers tasked with evaluating the projects. (My advisor and I are working with Mango Tree Uganda to evaluate their Primary Literacy Project.) Evaluation was one of the central themes of the conference, with a particular focus on learning from randomized controlled trials (RCTs). While RCTs are clearly the gold standard for evaluations nowadays, we nevertheless had a healthy discussion of their limitations. One area that got a lot of discussion was that while randomized trials are great for measuring the impact of a program, they typically tell you less about why a program did or did not work well.
We didn’t get into a more fundamental reason that RCTs are seeing pushback, however: the fact that they are framed as answering yes/no questions. Consider the perspective of someone working at an NGO considering an RCT framed that way. In that case a randomized trial is a complicated endeavor that costs a lot of effort and money and has only two possible outcomes: either you (1) learn that your intervention works, which is no surprise and life goes on as usual, or you (2) get told that your program is ineffective. In the latter case, you’re probably inclined to distrust the results: what the hell do researchers know about your program? Are they even measuring it correctly? Moreover, the results aren’t even particularly useful: as noted above, learning that your program isn’t working doesn’t tell you how to fix it.
This yes/no way of thinking about randomized trials is deeply flawed – they usually aren’t even that valuable for yes/no questions. If your question is “does this program we’re running do anything?” and the RCT tells you “no”, what it’s really saying is that no effect can be detected given the size of the sample used for the analysis. That’s not the same as telling you that your program doesn’t work; it’s the best possible estimate of the effect size given the data your collected, and telling you that the best guess is small enough that we can’t rule out no effect at all.
It is true that running a randomized trial will get you an unbiased answer to the “yes” side of the yes/no does-this-work question: if you find a statisticall significant effect, you can be fairly confident that it’s real. But it also tells you a whole lot more. First off, if properly done it will give you a quantitative answer to the question of what a given treatment does. Suppose you’re looking at raising vaccination rates, and the treatment group in your RCT has a rate that is 20 percentage points higher than the control group, significant at the 0.01 level. That’s not just “yes, it works”, it’s “it does about this much”. This is the best possible estimate of what the program is doing, even if it isn’t statistically significant. Better yet, RCTs also give you a lower and an upper bound on what that how much figure is. If your 99% confidence interval is 5 percentage points on either side, then you know with very high confidence that your program’s effect is no less than 15 percentage points (but no more than 25).*
I think a lot of implementers’ unease about RCTs would be mitigated if we focused more on the magnitudes of measured impacts instead of on significance stars. “We can’t rule out a zero effect” is uninformative, useless, and frankly a bit hostile – what we should be talking about is our best estiamte of a program’s effect, given the way it was implemented during the RCT. That alone won’t tell us why a program had less of an impact than we hoped, but it’s a whole lot better than just a thumbs down.