Ceteris Non Paribus

Ceteris Non Paribus is my personal blog, formerly hosted at nonparibus.wordpress.com and now found here. This blog is a place for me to put the ideas I have, and the stuff I come across, that I’ve managed to convince myself other people would be interested in seeing. See the About page for more on the reasons why I maintain a blog and the origin of the blog’s name.

My most recent posts can be found below, and a list of my most popular posts (based on recent views) is on the right.

Your pre-analysis plan should contain the plan for analyzing your data, and nothing else

I am currently reviewing a paper where the authors filed a lengthy analysis plan that does not actually say how they plan to analyze their data. This is an endemic problem in development economics RCTs. As a discipline we have adopted the ritual of filing pre-analysis plans (and the associated hassle cost) but not the practice of genuinely pre-specifying how we are going to look at the data.

On many of my projects, I have had coauthors insist on writing lengthy “pre-analysis plans” that are essentially papers with no data. We then duly file these on the AEA trial registry. Doing these plans is a ton of work: we spend pages and pages describing our experiments and writing about our scientific hypotheses. We cite the literature. We carefully frame our arguments. This has happened so many times, with so many collaborators, that I am not calling out any specific person here—my coauthors are simply following the norms in our field.

Those norms are wrong. None of that busywork actually pins down the specific data analyses we will do or constrains which hypothesis tests we will run. The portion of the analysis plan that is an actual plan for analyzing the data typically runs to 1-2 pages in lenght. It might be longer if the plan is super detailed about how variables will be pre-processed and cleaned. But most of the long-winded “PAPs” that typify our field do not specify any of that.

Including the extra detail in the pre-analysis plan actually makes it harder to read. As a reviewer, having to dig through a 50-page PDF to see what regressions people said they would run is a huge pain. I have taken to using ChatGPT 5.4 Thinking to find where the list of control variables is hidden (it is great at this sort of thing, by the way).

The welfare benefits of having pre-analysis plans in economics are dubious at best. We still publish many papers using secondary data where pre-specified analyses are typically impossible, and certainly not normative. I think it is better to let analyses of RCTs play with the specification a little bit to get significance stars than to skew the published literature toward less-credible research designs. Publication bias is probably smaller than not-right-in-the-first-place bias.

However, if we are going to have analysis plans then we need to focus them on the actual analysis plan part. That is my plea to development economists: stop with the writing of the contentless papers! Give me your plan for analyzing the data—and nothing else.

Principals as Coaches? An A/B Test suggests it can work

By Erik Andersen, Simon Graffy, Jason T. Kerwin, and Monica Lambon-Quayefio

This post was originally posted on the IPA blog, and is re-posted here with permission: https://poverty-action.org/principals-coaches-ab-test-suggests-it-can-work

IPA’s Partnership for Tech in Education (P4T-Ed) initiative supports the use of data and evidence to drive learning and improvement in the edtech sector. As part of the P4T-Ed initiative, IPA is supporting three randomized controlled trials (RCTs) to help generate rigorous evidence on edtech interventions’ potential impact on learning outcomes.

This is the first blog post in a series highlighting key findings, insights, and lessons learned from the RCTs. The series showcases how evidence is helping bridge the gap between innovation, implementation, and impact in edtech.

In this post, the research team working with Inspiring Teachers shares early results from work enabling school leaders to serve as instructional coaches.

This blog summarizes the results of an A/B test evaluating the effectiveness of using school leaders as coaches within a structured pedagogy program in Ghana. Our results show that training school leaders and equipping them with app-based coaching tools can improve the quality of teaching in a structured pedagogy program.

Structured Pedagogy with Coaching

In early grade classrooms across Africa, children’s learning outcomes are falling short of global benchmarks for quality education. There is a growing consensus that structured pedagogy programs, which give teachers step-by-step lesson guides, and aligned student materials, offer a scalable solution. In 2025, the Global Education Evidence Advisory Panel (GEEAP) identified structured pedagogy as a Best Buy for governments seeking to improve foundational learning.

Coaching is widely regarded as an essential component of structured pedagogy. The advice is that coaches should visit classrooms, observe lessons, provide feedback, and model good practices. Often termed “supportive supervision,” the logic is that classroom visits offer a vehicle for guidance and soft accountability that leaves teachers feeling equipped and expected to deliver their daily lessons with fidelity. The problem is that sustaining an army of roaming coaches is costly, noted in IPA’s 2023 Best Bets report which highlights teacher coaching as a promising intervention, for which more research is needed to determine how to achieve impact sustainably and at scale.

Many of the last generation of large-scale foundational literacy programs in Africa, such as those funded by USAID under the All Children Reading initiative, relied on external coaches. In the best of these, district staff were trained, equipped with tablets preloaded with coaching apps, and assigned to visit schools. The approach worked, but questions remained as to whether high-quality coaching could be scaled well within governments.

Implementation research ensued. One RCT showed that tablet-based observation tools could be used to assure coaching quality at scale. Programs with coaching were effective, but a cost analysis of USAID reading programs revealed that coaching was often the largest or second-largest ongoing cost of program delivery.

Optimists started exploring tech-facilitated remote coaching. In South Africa, early work suggested that calling teachers could be a viable alternative, but later data showed that in-person coaching was more effective. Elsewhere, in Senegal, “tele-coaching” was found to be cost-effective, but was ultimately not taken forward by the government, which chose to maintain the status quo approach by continuing to use a roaming staff model that delivered only infrequent school visits. This reminds us that political economy, institutional incentives, and even the way evidence is weighed all shape which reforms move forward.

Today, with foreign aid in retreat, the search for cost-effective approaches to supporting teachers in structured pedagogy programs has come to the fore. The question for anyone developing a program today is: What version of this program could government systems actually deliver with quality?

Inspiring Teachers, a nonprofit working in Ghana, Uganda, and Malawi, is using iterative design and testbed programs to answer this question in their Foundational Learning Improvement (TFLI) program tools. Our research group, with backing from IPA under the P4T-Ed Initiative and the Abdul Latif Jameel Poverty Action Lab (J-PAL) via the Learning for All Initiative, is supporting them in this journey. We are doing this through a series of A/B tests within a randomized controlled trials framework.

One of the innovations for the TFLI program is that its approach to structured pedagogy includes a digital layer: Classroom teachers, school leaders, and field staff are equipped with an app called SmartCoach, which is an offline-first mobile app that helps them run and track child learning assessments, school visits and teacher coaching, as well as providing videos of key pedagogical practices. Meanwhile, Inspiring Teachers program managers and government staff get a dashboard. Having extensive, real-time data on program implementation opens up new possibilities for research. Our current study has leveraged SmartCoach data to run our first A/B test comparing two different delivery models.

Can school leaders take on the role of coach?

One approach to make coaching more cost-effective would be to get school leaders to do it. School leaders (a.k.a. principals in the US or head teachers in much of the anglophone world) are already charged with managing teachers and giving them feedback on their performance. Getting them to support and supervise their teachers in implementing structured pedagogy would, therefore, be a natural extension of their responsibilities. However, school leaders have many other responsibilities—so they may not currently be providing teachers with instructional feedback, and even if they are, getting them to do it for a new program may be difficult.

To study the potential of this approach, we implemented an A/B test across our 40 treatment schools within an RCT testing the Inspiring Teachers TFLI program in Cape Coast, Ghana. Results from the first year of the RCT are very promising —and will be available in a forthcoming working paper!

The idea behind A/B tests is to rapidly try out variations in a program to optimize its performance. They are used widely in the tech sector to provide rapid insights rather than rigorous academic results. We thus adopted a lower initial statistical significance threshold for this test (70% rather than 95% significance), with a “significant” result guiding decision-making on which options to proceed with. We pre-specified the outcomes we would look at (test scores and teaching quality) internally, rather than posting an analysis plan to a public registry as we did for the main RCT. We analyzed the data in the same way as we will for our main study, to avoid fishing for significant findings.

In our A/B test, we randomly allocated 20 schools to each of two groups:

In Group “A” schools, only classroom teachers were given training on the literacy component (Inspiring Reading) of the TFLI program.
In Group “B,” everything was the same as Group “A”, with one addition: two leaders from each school were invited to a two-day training workshop where they were given an introduction to the science of reading, a walkthrough of the program, and training on using the SmartCoach app to observe and coach teachers.

We ran the A/B test for about two months, from the beginning of May 2025 through the end of the school year in June, at which time we gathered data on program fidelity, teaching quality, and children’s reading levels across both groups. Our data for this A/B test came from the same endline data collection we used in the main RCT: a set of reading assessments and classroom observations we conducted in June 2025. The classroom observations had enumerators record whether teachers engaged in various teaching activities and behaviors when teaching reading.

Our Results

The school leader training increased teaching quality by 10.6 points on our 0-100 teaching quality scale, as measured via observations by enumerators during the larger RCT at the endline (See Figure 1). Despite only running for two months, this effect clears our internal hurdle (70% significance) for an initially promising program change. What does this effect mean? The control group scored 43.5 points on our teaching quality scale and the treatment increased that number by 24 percent. We see that the quality gains were driven by behavior management, and class discussions, with other large but noisily estimated effects from students actively participating in in-class discussions, and teachers proactively checking students’ understanding of the material (See Figure 2).

Figure 1: Effect of School Leader Training Overall

Figure 2: Effects on Components of Teaching Quality

Graph showing the effects of components of teaching quality, such as talking to others when directed and being involved and asking questions

An analysis of data from the main RCT suggests that a 0.54-SD improvement in teaching quality could lead to a 0.14-SD improvement in early grade reading assessment (EGRA) test scores, over a year of implementation. That is a meaningful increase—it is as big as the effect of the median education intervention, but comes from a fairly small supplement to an existing program.

This A/B test showed that training school leaders as in-school coaches can improve teaching quality and help teachers deliver structured pedagogy with greater fidelity.

In line with this finding, Inspiring Teachers opted to include school leader training as part of a package of support for 80 government schools in the Cape Coast Metropolitan District in August 2025. This was the first of our A/B tests, and we are now planning our next one, which will focus on parent engagement. Our findings suggest that school leaders have an important role to play in the successful delivery of structured pedagogy programs.

Acknowledgements

This research is part of a study that was supported by the Jacobs Foundation and Innovations for Poverty Action via the Partnership for Tech in Education (P4T-Ed) Initiative and the Abdul Latif Jameel Poverty Action Lab (JPAL) via the Learning For All Initiative. The Inspiring Teachers TFLI program itself was established with catalytic support by the IDP Foundation and is now being expanded to government schools in partnership with Ghana Education Service, with support from the Global Schools Forum’s Impact at Scale Labs Program and the Prevail Fund.

If you’re going to plot means and CIs, they should be 83% CIs

Lots of papers present their key results as bar charts with 95% confidence intervals shown as whiskers. Here’s an example from one of my own slide decks:

Screenshot 2025-12-09 141238

The problem with doing this is that we can’t actually tell whether the key differences are statistically significant. The reduced-cost group is about a third of a grade level ahead at the end of 2015. Is that statistically significant? It doesn’t look like it, because the CIs overlap.

But just looking at CI overlap is misleading! Here’s an example where the CIs overlap but the p-value for the test of a difference in means is well below 0.05:

Overlapping 95% confidence intervals for two independent sample means showing a statistically significant difference at the 5% level

The key issue is that we don’t really care about estimating where the estimated mean will end up if we re-draw the sample, we care about whether the means are different from one another. One solution to this is a bit involved—you can show the CI for the treatment effect, like we do in this paper:

Screenshot 2025-12-09 141846

But there’s an even easier trick: if the samples are independently drawn, have the same sample size, and equal variances, you can just show the 83.4% CI instead.

Looking at whether these overlap genuinely will tell you whether the two means are significantly different. Here’s a ChatGPT 5.1 Thinking derivation of the math in case you’re interested. I also assert without proof that this will probably work pretty well even when the assumptions are slightly violated.

I learned about this clever trick from this blog post by Vanessa Cave.

Why USAID is Great for America and the World

Last Thursday I was invited by UW’s Economics Undergraduate Board to give this quarter’s Paul Heyne Seminar—a public lecture about economics aimed at a general audience. My talk was about why USAID is great for America and also the world, why we should keep it, and how to make it work even better:

Screenshot 2025-06-03 103018

The slides are here; please check them out! But I want to emphasize one point in particular: USAID is really great brand. Every sack of grain we hand out, every clinic we fund, every school we help build has this logo on it:

Screenshot 2025-06-03 103348

It explicitly tells recipients of foreign aid who donated the money (the American people) and has our country’s name (the USA) built right into the logo. The logos/names of countries’ aid agencies are pretty terrible:

Screenshot 2025-06-03 103404

China’s is okay; at least it has the name of the country written out in English. The UK’s doesn’t have their country’s name at all! It’s only identifiable if you recognize their seal. Back before they changed DfID to FCDO, the UK briefly rebranded their aid money to “UKAID”, with the subheading “from the British people”, in a direct imitation of our awesome logo.

The US should absolutely continue giving foreign aid to the developing world, and we will certainly do so in some form—everyone agrees that PEPFAR should continue, for example. Since we are going to still be in the business of providing foreign aid, we should not abandon the best brand in the industry.

A fun side note: the seminar is named in honor of the late Paul Theodore Heyne, who was a long-time lecturer at UW. He was one of the undergraduate advisors for Jeff Smith, who was one of my own Ph.D. advisors. Jeff now holds the Paul T. Heyne Distinguished Chair in Economics at the other UW.

Comments and suggestions on my slides are more than welcome!

A LaTeX .sty file for making clean, clear slides

If you’re using Beamer to make slides, your slides have way too much extra crap on them. Built-in navigation buttons at the bottom left, multiple layers of header and footer bars, dots that count the slides within each section, etc.

People can just ignore that extra stuff, right? Wrong. Every change in color and pattern grabs people’s attention. Attention is scarce, and you want to conserve it and focus it on the things that people actually care about.

Your slides probably have a bunch of other problems as well. Maybe you’re using a 4:3 aspect ratio, or the normal dense \itemize environment that crowds text together too much. Maybe you’re stuck with an ugly color scheme or you have a million old commands in your preamble that you’re not actually using.

Indeed, your slides probably look like these:

Just look at all the extra bells and whistles all over this page — Just look at all the extra bells and whistles all over this slide

The challenge with fixing this is that you have to make a ton of manual fixes to your Beamer preamble, which is a nightmare. That challenge is now solved, thanks to some elbow grease that I put in and the magic of generative AI (specifically Claude). You can download kerwin.sty here, and all you need to do to use it is upload it to your Overleaf project and then use the following lines:

\documentclass{beamer}

% Load custom style with desired color
    %Change Magenta4 to whatever you want here
    %Look up color options at \url{https://en.wikibooks.org/wiki/LaTeX/Colors}
\usepackage[color=Magenta4]{kerwin}

And here’s a clickable link to the color options for picking your font color: https://en.wikibooks.org/wiki/LaTeX/Colors

Here are the results:

The only manual edit I made to the original .tex file for this slide was to use the \littlegray command that I adopted from Adrienne Lucas (with slight updates).

The .sty file also has Paul Goldsmith-Pinkham’s \wideitemize environment, which you can use instead of \itemize. And it incorporates an ʻokina character so you can properly write Hawaiʻi and other Hawaiian words. It was originally based on beamerthemeFrankfurt.sty by Till Tantau; feel free to edit it as long as you also share your version freely.

Comments or suggestions are also more than welcome!

Don’t pick controls based on stars in your balance table

I just saw someone making this mistake yet again, and realized that this is a bit of applied econometric wisdom that is not widely known. In papers based on RCTs it is standard for Table 1 to be a balance table, showing the means of baseline variables by study arm and testing for the equality of those means. (People often also show joint balance tests across all variable—I have a recent working paper about how to run joint tests of equality correctly.)

A very common and incorrect way that people use their balance table is to pick controls for their regression analysis of the effect of the treatment. Specifically, the process that I often see is to look for variables with t-statistics above 1.96 (or 1.65) and use those as controls. That is, people control for anything with stars in their balance table.

This approach is wrong.

In their classic paper on running and analyzing randomized trials, Bruhn and McKenzie cite earlier work by Permutt showing the problems that this approach can cause. Specifically, the significance level of the test is lower than the nominal level, meaning that your CIs will be too wide and you will under-reject the null:

In addition to giving too-wide CIs, this approach can also lead to incorrect point estimates. Appendix E of my job-market paper (now forthcoming at the Economic Journal) shows that failing to control for baseline values of the outcome variable induces finite-sample bias in the estimates, even if the baseline test for the equality of means is insignificant:

Screenshot 2025-02-21 121329

(The “optimal” estimator here is just an ANCOVA specification where I control for Y measured at baseline).

What should you do instead? The best practice is to control for:

1) stratification cell indicators

2) anything else that was used in the randomization procedure

3) baseline values of the outcome variable

4) additional variables that are selected via the double lasso (-pdslasso- in Stata), although typically this procedure will not select very many variables

How to get your paper done

There is a lot of writing advice out there and most of it is bad. Even worse, much writing advice is totally inapplicable to empirical science writing. Considering that, it’s entirely likely that this advice will be bad as well. Thus—as with all advice—you should feel free to exercise free disposal on what follows. But it’s what I teach my students and it works for many of them.

Economics is not a “write a lot of words” discipline. At the margin it’s better to have more papers, but all else equal a shorter paper is better and quality matters much more than quantity.

It’s easy to let yourself get psyched out by having to Write A Paper (or, even worse, having to Write A Dissertation). A common suggestion for overcoming that mental barrier is the “just write a bad draft quickly and then fix it later” strategy. A common suggestion is that you should just let the words flow out of you, forcing yourself to write lots, and then do a ton of editing after the fact. That might be great in fields where you need to churn out tons of pages, but economics is not like that (and quantitative science is generally not like that either, or at least it shouldn’t be). Another related strategy is to set goals of writing X hundred words per day, or to force yourself to write words for blocks of Y minutes at a time. None of this is conducive to our goal as social scientists, which is to communicate specific things to our audience rather than to flood the zone with tons of content.

Writing an applied microeconomics paper can be broken into the following manageable steps. Other than steps 1 and 3 you can do all the rest in 1 week apiece.

Here are the steps:

1. Get your results figured out. This is easily the most important part. I make publication-quality tables and figures that are easy to read and tell a clear story. You don’t want to waste time writing up results before you know what they are.

– This is the actual research process. Obviously there is a lot that happens before this, but if you are ready to write a paper then you need results to write about.

2. Figure out your story and write it in 100 words maximum (the AER limit), which is 5 sentences. This is your abstract.

– If you can’t tell your story in 100 words then you don’t have a paper yet. You might have several papers, but most commonly you actually have zero.

3. Present your work to try to convince others of your story. This helps you hammer out what the argument is and nail down the exact results. It is iterative with 1 and 2.

4. Write your story in roughly 15 sentences that outline everything you will do in the paper. These will be the topic sentences for your introduction. (Good intros in applied micro have topic sentences for each intro paragraph, that can be read on their own and also say what the paragraph is about. Your paper has to be designed to be skimmed.)

These should discuss the following points in order (many with more than one sentence apiece):

– What is the research question

– What do you do

– What do you find

– Mechanisms for the results, if relevant

– What does this mean?

– How does this contribute to the literature, i.e. how does it build on what we already know?

Note that first five of these points could also be the five sentences in your abstract. If you have other key things to say then they belong somewhere in the introduction—most likely under “What does this mean?”

5. Fill in the details behind each topic sentence. This is your introduction. I aim for 4 pages but many good recent papers go longer.

– Supreet Kaur’s paper about nominal wage rigidities has a great example of how to write an effective introduction

– More generally, your introduction should emulate the structure of well-published papers in your area—there are plenty of great papers outside the top 5, but top 5 papers are much more likely to have nailed a good introduction

– reviews of the literature should only go into the contributions paragraphs at the end. Do not start with a lit review, no one cares. Do not write a separate lit review section; no one cares. You should integrate citations into your actual argument or leave them out.

– get to your results on page 1.

Writing your introduction is the hardest and most important part of writing the paper. Many people will not read anything else, even conditional on opening the paper. Economics papers are all way too long, and part of the reason is that they include many things that would be in the online methods appendix of a paper in the hard sciences. Our introductions are the equivalent of the entire paper in many disciplines.

– Introductions that use topic sentences for each paragraph and communicate the entirety of the paper are the norm in high-quality economics papers. My understanding is that this is how students are trained to write at Harvard and MIT. Once you start noticing this approach you will see it all over the top econ journals.

While this is an implication of the previous points, I want to state explicitly that your results go in the introduction. Do not tell people that your paper will estimate the effect of X on Y; tell them the effect of X on Y. Definitely do not wait until the last paragraph of the introduction to mention your results.

6. Write the data section of the paper

7. Write the methods section

8. Write up the results. This should be a discussion of what the results mean. Do not include a separate “discussion” section since in that case the results section is pointless.

– Robustness checks go in a subsection here

– Limitations should be acknowledged in here, and also in the introduction where relevant. If they’re major (or if a referee/editor demands it) you can make them a separate subsection.

9. If relevant, write the mechanisms section.

10. Write a conclusion section using Marc Bellemare’s conclusion formula (https://marcfbellemare.com/wordpress/12060). I think conclusions are pointless and shouldn’t exist but since you have to have one, Marc’s approach is the constrained optimum.

– My view is that anything that’s truly important in the conclusion should be in the introduction of the paper. Since many people will not read the conclusion, you should state the key parts in your introduction as well.

That’s it: 10 weeks and you have a paper, and you can easily do a bunch of other stuff on the side at the same time. Now you might say “but Jason, I don’t have my results and story figured out” which might be true. In that case step 1 is going to take longer—but your issue is not finishing the paper, but rather doing the research. The good news in that case is that doing research is fun! So at least you can enjoy it.

This post originated as an email that I sent to a student from another institution that I had a meeting with. I thought it might be helpful to other people as well so I am putting it up where more people can find it.

All Trains Lead to Crazy Town: Why I am Not an Effective Altruist (or a Philosopher)

If you are reading this post then you almost certainly have already heard of Effective Altruism, or EA. The EA movement has become increasingly influential over the past decade, and is currently getting a major publicity boost based on Will MacAskill’s new book What We Owe the Future, which among many other things was featured in Time Magazine.

For people who have not heard of EA at all, a brief summary is that it’s a combination of development economics and working to prevent Skynet from The Terminator from taking over the world. There is much more to it than those two things, but the basic idea is to take our moral intuitions and attempt to actually act on them to do the most good in the world. And it happens to be the case that many people’s moral intuitions imply that we should not only try to donate money to highly effective charities in the world’s poorest countries, but also worry about low-probability events that could end the human race. The thrust of “longtermism” is that we should care not only about people who live far away in terms of distance, but also those who live far away in terms of time. There are a lot of potential future humans so even super unlikely disasters that could ruin their lives or prevent them from being born are a big problem.

The fact that these conclusions probably strike you as a little crazy is not a coincidence. EAs are constantly pushing people to do things that seem crazy but are in fact consequences of moral principles that they agree to. For example, a number of EAs have literally donated their own kidneys to strangers in order to set up donation chains that lengthen or save many lives. That’s good! I haven’t donated a kidney myself, but my mother donated hers—to a friend, not a stranger. She’s not an EA and probably hasn’t ever heard of them; she’s just a very good person. But I give some credit to the EA movement for helping normalize kidney donation, which appears to be getting more common. Similarly, EAs have done a ton to push more money toward developing-country charities that have a huge impact on people’s lives, relative to stuff that doesn’t work or (more radically) charities that target people in richer places. When I argue that Americans should care as much about a stranger in Ghana as they do about a stranger in Kansas, they think that sounds kind of crazy. But a) it’s not and b) people find that notion less crazy than they used to. We are winning this argument, and the EAs deserve a lot of credit here.

My issue with EA is that its craziest implications are simply too crazy. One running theme in MacAsakill’s PR tour for his new book is the idea of the train to crazy town. You agree to some moral principles and you start exploring the implications, and then the next thing you know you’re agreeing to something absurd, like the repugnant conclusion that a world with 10^1000 totally miserable humans would be preferable to our current world.

The specific longtermist conclusion that seems crazy is that there’s a moral imperative to care almost solely about hypothetical future humans, because there are far more of them than current humans. By extension, we should put a lot of effort into preventing tiny risks of human extinctions far in the future. One response here is that we should be discounting these future events, and I agree with that. But it’s hard to come up with time-consistent discount rates that make moral sense and put any value on current humans. Scott Alexander thinks that the train to crazy town is a problem with EA or with moral philosophy.*

I think that’s wrong: the problem isn’t with moral philosophy, it’s that all trains lead to crazy town. I have every impression that this is how philosophy works in general: you start from some premises that seem sensible and then you dig into them until either everything falls apart or your reasoning leads to things that seem nuts. My take on this limited due to a kind of self-fulfilling prophecy; I didn’t study philosophy in college, but that’s because the basic exposure I got as a college freshman made me think that everything just spirals into nonsense. There are many examples of this. The “Gettier problem” attacks the very definition of knowledge as a justified true belief:

This is what philosophers actually believe

Another example comes from a conversation I had in graduate school, with a burned-out ninth-year philosophy PhD student who studied the reasons people do things. He summarized the debates in his field as “reasons are important because they’re reasons—get it?” He planned to drop out. It’s worth noting here that Michigan’s Philosophy department is among the very best in the world; it’s ranked #6, above Harvard and Stanford. Reasons Guy was at the center of the field, and felt like it was a ridiculous waste of time.

This problem recurs in topic after topic. It feels related to things that we know about the fundamental limitations of formal logic, starting with Gödel’s proof that any sufficiently powerful formal system is either inconsistent or incomplete. The Incompleteness Theorems were pretty cool to learn about but they didn’t exactly motivate me to want to study philosophy.

This clearly isn’t a novel idea—Itai Sher recently tweeted something that’s quite similar. But it’s pretty different from the notion that philosophers waste their time overthinking things that don’t matter. Instead, what’s going on is that if you drill down into any way of thinking about any important problem, you eventually reach a solid bedrock of nonsense.

Why does it matter that philosophy leads to these crazy conclusions? I think they matter for the EA movement for two reasons. First, well, they’re nuts. I think the fact that this is clearly true—everyone involved seems to agree on this—tells us that we should be skeptical of them. We don’t really have all these implications worked out fully. We could be totally wrong about them. We should remain open to the possibility that we are running into the limits of the logical systems we are trying to apply here, and cautious about promoting conclusions that don’t pass the smell test.

Second, they might undermine the real, huge successes of the EA movement. Practically speaking the main effect of EA has been to get a lot more money flowing toward charities like the Against Malaria Foundation that save children’s lives. It seems clearly correct that we should keep that going. The arguments that yield this conclusion to this might also lead to crazy town, but they aren’t there yet.

It seems as though MacAskill agrees with me on the practical upshot of this, which is to not actually be an effective altruist:

What should we do instead? I think MacAskill is exactly right, and that his suggestion amounts to basically saying we should all act like applied economists. Think at the margin, and figure out which changes could improve things. Do a little better, and don’t feel the need to reason all the way to crazy town.

Full disclosure: I plan to submit this post to this contest for essays criticizing EA, which was part of what originally motivated me to think about why I disagree the EA movement.

* You might assume that the repugnant conclusion is a specific failing of utilitarianism, but MacAskill claims it’s not and I trust that he’s done his homework here.

The moral imperative for honesty in development economics

There is a lot of bad research out there. Huge fractions of the published research literature do not replicate, and many studies aren’t even worth trying to replicate because they document uninteresting correlations that are not causal. This replication crisis is compounded by a “scaleup crisis”: even when results do replicate, they often do not hold at any appreciable scale. These problems are particularly bad in social science.

What can we do about the poor quality of social science research? There are a lot of top-down proposals. We should have analysis plans, and trial registries. We should subject our inferences to multiple testing adjustments. It is very hard to come up with general rules that will fix these problems, however. Even in a world where every analysis is pre-specified and all hypotheses are adjusted for multiple testing, and where every trial is registered and the results reported, people’s attention and time are finite. The exciting result is always going to garner more attention, more citations, and more likes and retweets. This “attention bias” problem is very difficult to fix.

When you are doing randomized program evaluations in developing countries, however, there is a bottom-up solution to this problem: getting the right answer really matters. Suppose you run an RCT that yields a sexy but incorrect result, be it due to deliberate fraud, a coding error, an accident of sampling error, a pilot that won’t scale, or a finding that holds just in one specific context. Someone is very likely to take your false finding and actually try and do it. Actual, scarce development resources will be thrown at your “solution”. Funding will go toward the wrong answer instead of the right ones. Finite inputs like labor and energy will be expended on the wrong thing.

And more than in any other domain of social science, doing the wrong thing will make a huge difference. The world’s poorest people live on incomes that are less than 1% of what we enjoy here in America. We could take this same budget and just give it to them in cash, which would at a minimum reduce poverty temporarily. The benefits of helping the global poor, in terms of their actual well-being, are drastically higher than those of helping any group in a rich country. $1000 is a decent chunk of change in America, but it could mean the difference between life and death for a subsistence farmer in sub-Saharan Africa. Thus, when you get an exciting result, you have an obligation to look at your tables and go “really?”

This does not mean that no development economics research is ever wrong, or that nobody working in the field ever skews their results for career reasons. Career incentives can be powerful, even in fields with similar imperatives for honesty: witness the recent exposure of fraudulent Alzheimer’s research, which may have derailed drug development and harmed millions of people. What it means is that those career incentives are counterbalanced by a powerful moral imperative to tell the truth.

Truth-telling is important not just about our own work, but (maybe moreso) when we are called upon to summarize knowledge more broadly. Literature reviews in development economics aren’t just academically interesting; they have the potential to reshape where money gets spent and which programs get implemented. What I mean by honesty here is that when we talk to policymakers or journalist or lay people about which development programs work, we shouldn’t let our views be skewed by our own research agendas or trends in the field. For example, I have written several papers about a mother-tongue-first literacy program in Uganda, the NULP. The program works exceedingly well on average, although it is not a panacea for the learning crisis. People often ask me whether mother-tongue instruction is the best use of education funds, and I tell them no—I do not think it was the core driver of the NULP’s success, and studies that isolate changes in the language of instruction support that view. Note the countervailing incentives I face here: more spending on mother-tongue instruction might yield more citations for my work, and the approach is very popular so I am often telling people what they don’t want to hear. But far outweighing those is the fact that what I say might really matter, and getting it wrong means that kids won’t learn to read. This is a powerful motive to do my best to get the right answer.

Honesty in assessing the overall evidence also mitigate the “attention bias” problem. Exciting results will still get bursts of attention, but when we are called upon to give our view of which programs work best, we can and should focus on the broader picture painted by the evidence. This is especially critical in development economics, where we aren’t just seeking scientific truths but trying to solve some of the world’s most pressing problems.

Nothing Scales

I recently posted a working paper where we argue that appointments can substitute for financial commitment devices. I’m pretty proud of this paper: it uses a meticulously-designed experiment to show the key result, and the empirical work is very careful and was all pre-specified. We apply the latest and best practices in selecting controls and adjusting for multiple hypothesis testing. Our results are very clear, and we tell a clear story that teaches us something very important about self-control problems in healthcare. Appointments help in part because they are social commitment devices, and—because there are no financial stakes—they don’t have the problem of people losing money when they don’t follow through. The paper also strongly suggests that appointments are a useful tool at encouraging people to utilize preventive healthcare—they increase the HIV testing rate by over 100%.

That’s pretty promising! Maybe we should try appointments as a way to encourage people to get vaccinated for covid, too? Well, maybe not. A new NBER working paper tries something similar for covid vaccinations in the US. Not only does texting people a link to an easy-to-use appointment website not work, neither does anything else that they try, including just paying people $50 to get vaccinated.

Different people, different treatment effects

Why don’t appointments increase covid vaccinations when they worked for HIV testing? The most likely story is that this is a different group of people and their treatment effects are different. I don’t just mean that one set is in Contra Costa County and the other one is in the city of Zomba, although that probably matters. I mean that the Chang et al. study specifically targets the vaccine hesitant, whereas men in our study mostly wanted to get tested for HIV: 92 percent of our sample had previously been tested for HIV at least once. In other words, if you found testing-hesitant men in urban southern Malawi, these behavioral nudges probably wouldn’t help encourage them to get an HIV test either. That makes sense if you think about it: we show that our intervention helps people overcome procrastination and other self-control problems. These are fundamentally problems of people wanting to get tested but not managing to get around to it. The vaccine-hesitant aren’t procrastinating; by and large they just don’t want to get a shot. Indeed, other research confirms that appointments do increase HIV testing rates—just as this explanation would predict.

This is all to say that the treatment effects are heterogeneous: the treatment affects each person—or each observation in your dataset—differently. This is an issue that we can deal with. Our appointments study documents exactly the kind of heterogeneity that the theory above would predict. The treatment effects for appointments are concentrated overwhelmingly among people who want to enroll in a financial commitment device to help ensure they go in for an HIV test. Thus we could forecast that people who don’t want a covid shot at all definitely won’t have their behavior changed much by an appointment.

But trying to analyze this is very rare, which is a disaster for social science research. Good empirical social science almost always focuses on estimating a causal relationship: what is β in Y = α + βX + ϵ? But these relationships are all over the place: there is no underlying β to be estimated! Let’s ignore nonlinearity for a second, and say we are happy with the best linear approximation to the underlying function. The right answer here still potentially differs for every person, and at every point in time.* Your estimate is just some weighted average of a bunch of unit-specific βs, even if you avoid randomized experiments and run some other causal inference approach on the entire population.

This isn’t a new insight: the Nobel prize was just given out in part for showing that an IV identifies a local average treatment effect for some slice of the population. Other non-experimental methods won’t rescue us either: identification is always coming from some small subset of the data. The Great Difference-in-Differences Reckoning is driven, at its core, by the realization that DiDs are identified off of specific comparisons between units, and each unit’s treatment effect can be different. Matching estimators usually don’t yield consistent estimates of causal effects, but when they do it’s because we are exploiting idiosyncrasies in treatment assignment for a small number of people. Non-quantitative methods are in an even worse spot. I am a fan of the idea that qualitative data can be used to understand the mechanisms behind treatment effects—but along with person-specific treatment effects, we need to try to capture person-specific mechanisms that might change over time.

Nothing scales

Treatment effect heterogeneity also helps explain why the development literature is littered with failed attempts to scale interventions up or run them in different contexts. Growth mindset did nothing when scaled up in Argentina. Running the “Jamaican Model” of home visits to promote child development at large scale yields far smaller effects than the original study. The list goes on and on; to a first approximation, nothing we try in development scales.

Estimated effect sizes for the Jamaican Model at different scales

Why not? Scaling up a program requires running it on new people who may have different treatment effects. And the finding, again and again, is that this is really hard to do well. Take the “Sugar Daddies” HIV-prevention intervention, which worked in Kenya, for example. It was much less effective in Botswana, a context where HIV treatment is more accessible and sugar daddies come from different age ranges.** Treatment effects may also vary within person over time: scaling up the “No Lean Season” intervention involved doing it again later on, and one theory for why it didn’t work is that the year they tried it again was marked by extreme floods. Note that this is a very different challenge from the “replication crisis” that has most famously plagued social psychology. The average treatment effect of appointments in our study matches the one in the other study I mentioned above, and the original study that motivated No Lean Season literally contains a second RCT that, in part, replicates the main result.

I also doubt that this is about some intrinsic problem with scaling things up. The motivation for our appointments intervention was that, anecdotally, appointments work at huge scale in the developed world to do things like get people to go to the dentist. I’m confident that if we just ran the same intervention on more people who were procrastinating about getting HIV tests, we could achieve similar results. However, we rarely actually run the original intervention at larger scale. Instead, the tendency is to water it down, which can make things significantly less effective. Case in point: replicating an effective education intervention in Uganda in more schools yielded virtually-identical results, whereas a modified program that tried to simulate how policymakers would reduce costs was substantially worse. That’s the theory that Evidence Action favors for why No Lean Season didn’t work at scale—they think the implementation changed in important ways.

What do we do about this?

I see two ways forward. First, we need a better understanding of how to get policymakers to actually implement interventions that work. There is some exciting new work on this front in a recent issue of the AER, but this seems like very low-hanging fruit to me. Time and again, we have real trouble just replicating actual treatments that work—instead, the scaled-up version almost always is watered down.

Second, every study should report estimates of how much treatment effects vary, and try to link that variation to a model of human behavior. There is a robust econometric literature on treatment effect heterogeneity, but actually looking at this in applied work is very rare. Let’s take education as an example. I just put out another new working paper with a different set of coauthors called “Some Children Left Behind”. We look at how much the effects of an education program vary across kids. The nonparametric Frechet-Hoffding lower bounds on treatment effect variation are massive; treatment effects vary from no gain at all to a 3-SD increase in test scores. But as far as I know nobody’s even looked at that for other education programs. Across eight systematic reviews of developing-country education RCTs (covering hundreds of studies), we found just four mentions of variation in treatment effects, and all of them used the “interact treatment with X” approach. That’s unlikely to pick up much: we find that cutting-edge ML techniques can explain less than 10 percent of the treatment effect heterogeneity in our data using our available Xs. The real challenge here is to link the variation in treatment effects to our models of the world, which means we are going to need to collect far better Xs.

This latter point means social scientists have a lot of work ahead of us. None of the techniques we use to look at treatment effect variation currently work for non-experimental causal inference techniques. Given how crucial variation in treatment effects is, this seems like fertile ground for applied econometricians. Moreover, almost all of our studies are underpowered for understanding heterogeneous treatment effects, and in many cases we aren’t currently collecting the kinds of baseline data we would need to really understand the heterogeneity—remember, ML didn’t find much in our education paper. That means that the real goal here is quite elusive: how do we predict which things will replicate out-of-sample and which won’t? To get this right we need new methods, more and better data, and a renewed focus on how the world really works.