How to get your paper done

There is a lot of writing advice out there and most of it is bad. Even worse, much writing advice is totally inapplicable to empirical science writing. Considering that, it’s entirely likely that this advice will be bad as well. Thus—as with all advice—you should feel free to exercise free disposal on what follows. But it’s what I teach my students and it works for many of them.

Economics is not a “write a lot of words” discipline. At the margin it’s better to have more papers, but all else equal a shorter paper is better and quality matters much more than quantity.

It’s easy to let yourself get psyched out by having to Write A Paper (or, even worse, having to Write A Dissertation). A common suggestion for overcoming that mental barrier is the “just write a bad draft quickly and then fix it later” strategy. A common suggestion is that you should just let the words flow out of you, forcing yourself to write lots, and then do a ton of editing after the fact. That might be great in fields where you need to churn out tons of pages, but economics is not like that (and quantitative science is generally not like that either, or at least it shouldn’t be). Another related strategy is to set goals of writing X hundred words per day, or to force yourself to write words for blocks of Y minutes at a time. None of this is conducive to our goal as social scientists, which is to communicate specific things to our audience rather than to flood the zone with tons of content.

Writing an applied microeconomics paper can be broken into the following manageable steps. Other than steps 1 and 3 you can do all the rest in 1 week apiece.

Here are the steps:

1. Get your results figured out. This is easily the most important part. I make publication-quality tables and figures that are easy to read and tell a clear story. You don’t want to waste time writing up results before you know what they are.

– This is the actual research process. Obviously there is a lot that happens before this, but if you are ready to write a paper then you need results to write about.

2. Figure out your story and write it in 100 words maximum (the AER limit), which is 5 sentences. This is your abstract.

– If you can’t tell your story in 100 words then you don’t have a paper yet. You might have several papers, but most commonly you actually have zero.

3. Present your work to try to convince others of your story. This helps you hammer out what the argument is and nail down the exact results. It is iterative with 1 and 2.

4. Write your story in roughly 15 sentences that outline everything you will do in the paper. These will be the topic sentences for your introduction. (Good intros in applied micro have topic sentences for each intro paragraph, that can be read on their own and also say what the paragraph is about. Your paper has to be designed to be skimmed.)

These should discuss the following points in order (many with more than one sentence apiece):

– What is the research question

– What do you do

– What do you find

– Mechanisms for the results, if relevant

– What does this mean?

– How does this contribute to the literature, i.e. how does it build on what we already know?

Note that first five of these points could also be the five sentences in your abstract. If you have other key things to say then they belong somewhere in the introduction—most likely under “What does this mean?”

5. Fill in the details behind each topic sentence. This is your introduction. I aim for 4 pages but many good recent papers go longer.

Supreet Kaur’s paper about nominal wage rigidities has a great example of how to write an effective introduction

– More generally, your introduction should emulate the structure of well-published papers in your area—there are plenty of great papers outside the top 5, but top 5 papers are much more likely to have nailed a good introduction

– reviews of the literature should only go into the contributions paragraphs at the end. Do not start with a lit review, no one cares. Do not write a separate lit review section; no one cares. You should integrate citations into your actual argument or leave them out.

– get to your results on page 1.

Writing your introduction is the hardest and most important part of writing the paper. Many people will not read anything else, even conditional on opening the paper. Economics papers are all way too long, and part of the reason is that they include many things that would be in the online methods appendix of a paper in the hard sciences. Our introductions are the equivalent of the entire paper in many disciplines.

– Introductions that use topic sentences for each paragraph and communicate the entirety of the paper are the norm in high-quality economics papers. My understanding is that this is how students are trained to write at Harvard and MIT. Once you start noticing this approach you will see it all over the top econ journals.

While this is an implication of the previous points, I want to state explicitly that your results go in the introduction. Do not tell people that your paper will estimate the effect of X on Y; tell them the effect of X on Y. Definitely do not wait until the last paragraph of the introduction to mention your results.

6. Write the data section of the paper

7. Write the methods section

8. Write up the results. This should be a discussion of what the results mean. Do not include a separate “discussion” section since in that case the results section is pointless.

– Robustness checks go in a subsection here

– Limitations should be acknowledged in here, and also in the introduction where relevant. If they’re major (or if a referee/editor demands it) you can make them a separate subsection.

9. If relevant, write the mechanisms section.

10. Write a conclusion section using Marc Bellemare’s conclusion formula (https://marcfbellemare.com/wordpress/12060). I think conclusions are pointless and shouldn’t exist but since you have to have one, Marc’s approach is the constrained optimum.

– My view is that anything that’s truly important in the conclusion should be in the introduction of the paper. Since many people will not read the conclusion, you should state the key parts in your introduction as well.

That’s it: 10 weeks and you have a paper, and you can easily do a bunch of other stuff on the side at the same time. Now you might say “but Jason, I don’t have my results and story figured out” which might be true. In that case step 1 is going to take longer—but your issue is not finishing the paper, but rather doing the research. The good news in that case is that doing research is fun! So at least you can enjoy it.

This post originated as an email that I sent to a student from another institution that I had a meeting with. I thought it might be helpful to other people as well so I am putting it up where more people can find it.

All Trains Lead to Crazy Town: Why I am Not an Effective Altruist (or a Philosopher)

If you are reading this post then you almost certainly have already heard of Effective Altruism, or EA. The EA movement has become increasingly influential over the past decade, and is currently getting a major publicity boost based on Will MacAskill’s new book What We Owe the Future, which among many other things was featured in Time Magazine.

For people who have not heard of EA at all, a brief summary is that it’s a combination of development economics and working to prevent Skynet from The Terminator from taking over the world. There is much more to it than those two things, but the basic idea is to take our moral intuitions and attempt to actually act on them to do the most good in the world. And it happens to be the case that many people’s moral intuitions imply that we should not only try to donate money to highly effective charities in the world’s poorest countries, but also worry about low-probability events that could end the human race. The thrust of “longtermism” is that we should care not only about people who live far away in terms of distance, but also those who live far away in terms of time. There are a lot of potential future humans so even super unlikely disasters that could ruin their lives or prevent them from being born are a big problem.

The fact that these conclusions probably strike you as a little crazy is not a coincidence. EAs are constantly pushing people to do things that seem crazy but are in fact consequences of moral principles that they agree to. For example, a number of EAs have literally donated their own kidneys to strangers in order to set up donation chains that lengthen or save many lives. That’s good! I haven’t donated a kidney myself, but my mother donated hers—to a friend, not a stranger. She’s not an EA and probably hasn’t ever heard of them; she’s just a very good person. But I give some credit to the EA movement for helping normalize kidney donation, which appears to be getting more common. Similarly, EAs have done a ton to push more money toward developing-country charities that have a huge impact on people’s lives, relative to stuff that doesn’t work or (more radically) charities that target people in richer places. When I argue that Americans should care as much about a stranger in Ghana as they do about a stranger in Kansas, they think that sounds kind of crazy. But a) it’s not and b) people find that notion less crazy than they used to. We are winning this argument, and the EAs deserve a lot of credit here.

My issue with EA is that its craziest implications are simply too crazy. One running theme in MacAsakill’s PR tour for his new book is the idea of the train to crazy town. You agree to some moral principles and you start exploring the implications, and then the next thing you know you’re agreeing to something absurd, like the repugnant conclusion that a world with 10^1000 totally miserable humans would be preferable to our current world.

Picture1

The specific longtermist conclusion that seems crazy is that there’s a moral imperative to care almost solely about hypothetical future humans, because there are far more of them than current humans. By extension, we should put a lot of effort into preventing tiny risks of human extinctions far in the future. One response here is that we should be discounting these future events, and I agree with that. But it’s hard to come up with time-consistent discount rates that make moral sense and put any value on current humans. Scott Alexander thinks that the train to crazy town is a problem with EA or with moral philosophy.*

I think that’s wrong: the problem isn’t with moral philosophy, it’s that all trains lead to crazy town. I have every impression that this is how philosophy works in general: you start from some premises that seem sensible and then you dig into them until either everything falls apart or your reasoning leads to things that seem nuts.  My take on this limited due to a kind of self-fulfilling prophecy; I didn’t study philosophy in college, but that’s because the basic exposure I got as a college freshman made me think that everything just spirals into nonsense. There are many examples of this. The “Gettier problem” attacks the very definition of knowledge as a justified true belief:

Picture2
This is what philosophers actually believe

Another example comes from a conversation I had in graduate school, with a burned-out ninth-year philosophy PhD student who studied the reasons people do things. He summarized the debates in his field as “reasons are important because they’re reasons—get it?” He planned to drop out. It’s worth noting here that Michigan’s Philosophy department is among the very best in the world; it’s ranked #6, above Harvard and Stanford. Reasons Guy was at the center of the field, and felt like it was a ridiculous waste of time.

This problem recurs in topic after topic. It feels related to things that we know about the fundamental limitations of formal logic, starting with Gödel’s proof that any sufficiently powerful formal system is either inconsistent or incomplete. The Incompleteness Theorems were pretty cool to learn about but they didn’t exactly motivate me to want to study philosophy.

This clearly isn’t a novel idea—Itai Sher recently tweeted something that’s quite similar. But it’s pretty different from the notion that philosophers waste their time overthinking things that don’t matter. Instead, what’s going on is that if you drill down into any way of thinking about any important problem, you eventually reach a solid bedrock of nonsense.

Picture3

Why does it matter that philosophy leads to these crazy conclusions? I think they matter for the EA movement for two reasons. First, well, they’re nuts. I think the fact that this is clearly true—everyone involved seems to agree on this—tells us that we should be skeptical of them. We don’t really have all these implications worked out fully. We could be totally wrong about them. We should remain open to the possibility that we are running into the limits of the logical systems we are trying to apply here, and cautious about promoting conclusions that don’t pass the smell test.

Second, they might undermine the real, huge successes of the EA movement. Practically speaking the main effect of EA has been to get a lot more money flowing toward charities like the Against Malaria Foundation that save children’s lives. It seems clearly correct that we should keep that going. The arguments that yield this conclusion to this might also lead to crazy town, but they aren’t there yet.

It seems as though MacAskill agrees with me on the practical upshot of this, which is to not actually be an effective altruist:

Picture4

What should we do instead? I think MacAskill is exactly right, and that his suggestion amounts to basically saying we should all act like applied economists. Think at the margin, and figure out which changes could improve things. Do a little better, and don’t feel the need to reason all the way to crazy town.

Full disclosure: I plan to submit this post to this contest for essays criticizing EA, which was part of what originally motivated me to think about why I disagree the EA movement.

* You might assume that the repugnant conclusion is a specific failing of utilitarianism, but MacAskill claims it’s not and I trust that he’s done his homework here.

The moral imperative for honesty in development economics

There is a lot of bad research out there. Huge fractions of the published research literature do not replicate, and many studies aren’t even worth trying to replicate because they document uninteresting correlations that are not causal. This replication crisis is compounded by a “scaleup crisis”: even when results do replicate, they often do not hold at any appreciable scale. These problems are particularly bad in social science.

What can we do about the poor quality of social science research? There are a lot of top-down proposals. We should have analysis plans, and trial registries. We should subject our inferences to multiple testing adjustments. It is very hard to come up with general rules that will fix these problems, however. Even in a world where every analysis is pre-specified and all hypotheses are adjusted for multiple testing, and where every trial is registered and the results reported, people’s attention and time are finite. The exciting result is always going to garner more attention, more citations, and more likes and retweets. This “attention bias” problem is very difficult to fix.

When you are doing randomized program evaluations in developing countries, however, there is a bottom-up solution to this problem: getting the right answer really matters. Suppose you run an RCT that yields a sexy but incorrect result, be it due to deliberate fraud, a coding error, an accident of sampling error, a pilot that won’t scale, or a finding that holds just in one specific context. Someone is very likely to take your false finding and actually try and do it. Actual, scarce development resources will be thrown at your “solution”. Funding will go toward the wrong answer instead of the right ones. Finite inputs like labor and energy will be expended on the wrong thing.

And more than in any other domain of social science, doing the wrong thing will make a huge difference. The world’s poorest people live on incomes that are less than 1% of what we enjoy here in America. We could take this same budget and just give it to them in cash, which would at a minimum reduce poverty temporarily. The benefits of helping the global poor, in terms of their actual well-being, are drastically higher than those of helping any group in a rich country. $1000 is a decent chunk of change in America, but it could mean the difference between life and death for a subsistence farmer in sub-Saharan Africa. Thus, when you get an exciting result, you have an obligation to look at your tables and go “really?”

This does not mean that no development economics research is ever wrong, or that nobody working in the field ever skews their results for career reasons. Career incentives can be powerful, even in fields with similar imperatives for honesty: witness the recent exposure of fraudulent Alzheimer’s research, which may have derailed drug development and harmed millions of people. What it means is that those career incentives are counterbalanced by a powerful moral imperative to tell the truth.

Truth-telling is important not just about our own work, but (maybe moreso) when we are called upon to summarize knowledge more broadly. Literature reviews in development economics aren’t just academically interesting; they have the potential to reshape where money gets spent and which programs get implemented.  What I mean by honesty here is that when we talk to policymakers or journalist or lay people about which development programs work, we shouldn’t let our views be skewed by our own research agendas or trends in the field. For example, I have written several papers about a mother-tongue-first literacy program in Uganda, the NULP. The program works exceedingly well on average, although it is not a panacea for the learning crisis. People often ask me whether mother-tongue instruction is the best use of education funds, and I tell them no—I do not think it was the core driver of the NULP’s success, and studies that isolate changes in the language of instruction support that view. Note the countervailing incentives I face here: more spending on mother-tongue instruction might yield more citations for my work, and the approach is very popular so I am often telling people what they don’t want to hear.  But far outweighing those is the fact that what I say might really matter, and getting it wrong means that kids won’t learn to read. This is a powerful motive to do my best to get the right answer.

Honesty in assessing the overall evidence also mitigate the “attention bias” problem. Exciting results will still get bursts of attention, but when we are called upon to give our view of which programs work best, we can and should focus on the broader picture painted by the evidence. This is especially critical in development economics, where we aren’t just seeking scientific truths but trying to solve some of the world’s most pressing problems.

Nothing Scales

I recently posted a working paper where we argue that appointments can substitute for financial commitment devices. I’m pretty proud of this paper: it uses a meticulously-designed experiment to show the key result, and the empirical work is very careful and was all pre-specified. We apply the latest and best practices in selecting controls and adjusting for multiple hypothesis testing. Our results are very clear, and we tell a clear story that teaches us something very important about self-control problems in healthcare. Appointments help in part because they are social commitment devices, and—because there are no financial stakes—they don’t have the problem of people losing money when they don’t follow through. The paper also strongly suggests that appointments are a useful tool at encouraging people to utilize preventive healthcare—they increase the HIV testing rate by over 100%.

That’s pretty promising! Maybe we should try appointments as a way to encourage people to get vaccinated for covid, too? Well, maybe not. A new NBER working paper tries something similar for covid vaccinations in the US.  Not only does texting people a link to an easy-to-use appointment website not work, neither does anything else that they try, including just paying people $50 to get vaccinated.

Different people, different treatment effects

Why don’t appointments increase covid vaccinations when they worked for HIV testing? The most likely story is that this is a different group of people and their treatment effects are different. I don’t just mean that one set is in Contra Costa County and the other one is in the city of Zomba, although that probably matters. I mean that the Chang et al. study specifically targets the vaccine hesitant, whereas men in our study mostly wanted to get tested for HIV: 92 percent of our sample had previously been tested for HIV at least once. In other words, if you found testing-hesitant men in urban southern Malawi, these behavioral nudges probably wouldn’t help encourage them to get an HIV test either. That makes sense if you think about it: we show that our intervention helps people overcome procrastination and other self-control problems. These are fundamentally problems of people wanting to get tested but not managing to get around to it. The vaccine-hesitant aren’t procrastinating; by and large they just don’t want to get a shot. Indeed, other research confirms that appointments do increase HIV testing rates—just as this explanation would predict.

This is all to say that the treatment effects are heterogeneous: the treatment affects each person—or each observation in your dataset—differently. This is an issue that we can deal with. Our appointments study documents exactly the kind of heterogeneity that the theory above would predict. The treatment effects for appointments are concentrated overwhelmingly among people who want to enroll in a financial commitment device to help ensure they go in for an HIV test. Thus we could forecast that people who don’t want a covid shot at all definitely won’t have their behavior changed much by an appointment.

But trying to analyze this is very rare, which is a disaster for social science research. Good empirical social science almost always focuses on estimating a causal relationship: what is β in Y = α + βX + ϵ? But these relationships are all over the place: there is no underlying β to be estimated! Let’s ignore nonlinearity for a second, and say we are happy with the best linear approximation to the underlying function. The right answer here still potentially differs for every person, and at every point in time.* Your estimate is just some weighted average of a bunch of unit-specific βs, even if you avoid randomized experiments and run some other causal inference approach on the entire population.

This isn’t a new insight: the Nobel prize was just given out in part for showing that an IV identifies a local average treatment effect for some slice of the population. Other non-experimental methods won’t rescue us either: identification is always coming from some small subset of the data. The Great Difference-in-Differences Reckoning is driven, at its core, by the realization that DiDs are identified off of specific comparisons between units, and each unit’s treatment effect can be different. Matching estimators usually don’t yield consistent estimates of causal effects, but when they do it’s because we are exploiting idiosyncrasies in treatment assignment for a small number of people. Non-quantitative methods are in an even worse spot. I am a fan of the idea that qualitative data can be used to understand the mechanisms behind treatment effects—but along with person-specific treatment effects, we need to try to capture person-specific mechanisms that might change over time.

Nothing scales

Treatment effect heterogeneity also helps explain why the development literature is littered with failed attempts to scale interventions up or run them in different contexts. Growth mindset did nothing when scaled up in Argentina. Running the “Jamaican Model” of home visits to promote child development at large scale yields far smaller effects than the original study. The list goes on and on; to a first approximation, nothing we try in development scales.

Estimated effect sizes for the Jamaican Model at different scales
Estimated effect sizes for the Jamaican Model at different scales

Why not? Scaling up a program requires running it on new people who may have different treatment effects. And the finding, again and again, is that this is really hard to do well. Take the “Sugar Daddies” HIV-prevention intervention, which worked in Kenya, for example. It was much less effective in Botswana, a context where HIV treatment is more accessible and sugar daddies come from different age ranges.** Treatment effects may also vary within person over time: scaling up the “No Lean Season” intervention involved  doing it again later on, and one theory for why it didn’t work is that the year they tried it again was marked by extreme floods. Note that this is a very different challenge from the “replication crisis” that has most famously plagued social psychology. The average treatment effect of appointments in our study matches the one in the other study I mentioned above, and the original study that motivated No Lean Season literally contains a second RCT that, in part, replicates the main result.

I also doubt that this is about some intrinsic problem with scaling things up. The motivation for our appointments intervention was that, anecdotally, appointments work at huge scale in the developed world to do things like get people to go to the dentist. I’m confident that if we just ran the same intervention on more people who were procrastinating about getting HIV tests, we could achieve similar results. However, we rarely actually run the original intervention at larger scale. Instead, the tendency is to water it down, which can make things significantly less effective. Case in point: replicating an effective education intervention in Uganda in more schools yielded virtually-identical results, whereas a modified program that tried to simulate how policymakers would reduce costs was substantially worse. That’s the theory that Evidence Action favors for why No Lean Season didn’t work at scale—they think the implementation changed in important ways.

What do we do about this?

I see two ways forward. First, we need a better understanding of how to get policymakers to actually implement interventions that work. There is some exciting new work on this front in a recent issue of the AER, but this seems like very low-hanging fruit to me. Time and again, we have real trouble just replicating actual treatments that work—instead, the scaled-up version almost always is watered down.

Second, every study should report estimates of how much treatment effects vary, and try to link that variation to a model of human behavior.  There is a robust econometric literature on treatment effect heterogeneity, but actually looking at this in applied work is very rare. Let’s take education as an example. I just put out another new working paper with a different set of coauthors called “Some Children Left Behind”. We look at how much the effects of an education program vary across kids. The nonparametric Frechet-Hoffding lower bounds on treatment effect variation are massive; treatment effects vary from no gain at all to a 3-SD increase in test scores. But as far as I know nobody’s even looked at that for other education programs. Across eight systematic reviews of developing-country education RCTs (covering hundreds of studies), we found just four mentions of variation in treatment effects, and all of them used the “interact treatment with X” approach. That’s unlikely to pick up much: we find that cutting-edge ML techniques can explain less than 10 percent of the treatment effect heterogeneity in our data using our available Xs. The real challenge here is to link the variation in treatment effects to our models of the world, which means we are going to need to collect far better Xs.

This latter point means social scientists have a lot of work ahead of us. None of the techniques we use to look at treatment effect variation currently work for non-experimental causal inference techniques. Given how crucial variation in treatment effects is, this seems like fertile ground for applied econometricians. Moreover, almost all of our studies are underpowered for understanding heterogeneous treatment effects, and in many cases we aren’t currently collecting the kinds of baseline data we would need to really understand the heterogeneity—remember, ML didn’t find much in our education paper. That means that the real goal here is quite elusive: how do we predict which things will replicate out-of-sample and which won’t? To get this right we need new methods, more and better data, and a renewed focus on how the world really works.

*And potentially on everyone else’s value of X as well, due to spillovers and GE effects.
** This point is not new to the literature on scale-up: Hunt Alcott argues that RCTs specifically select locations with the largest treatment effects.

Pay Me Later: Savings Constraints and the Demand for Deferred Payments

[This is a revised version of this earlier blog post about a previous draft of the same paper]

Why would you ask your employer not to pay you yet? This is something I would personally never do. If I don’t want to spend money yet, I can just keep it in a bank account. But it’s a fairly common request in developing countries: my own field staff have asked this of me several times, and dairy farmers in Kenya will actually accept lower pay in order to put off getting paid.

There is in fact a straightforward logic to wanting to get paid later. In developed economies, savings earns a positive return, but in much of the developing world, people face a negative effective interest rate on their savings. Banks are loaded with transaction costs and hidden fees, and money hidden elsewhere could be stolen or lost. So deferred wages can be a very attractive way to save money until you actually want to spend it.

Lasse Brune, Eric Chyn, and I have a paper now forthcoming in the American Economic Review that takes that idea and turns it into a practical savings product for employees of a tea company in Malawi. (The ungated pre-print is available here.) Workers could choose to sign up and have a fraction of their pay withheld each payday, to be paid out in a lump sum at the end of the three-month harvest season.  About half of workers chose to sign up for the product; this choice was actually implemented at random for half of the workers who signed up. Workers who signed up saved 14% of their income in the scheme and increased their net savings by 23%.

savings
Histogram of final accumulated balances in the deferred wages account at the end of the season.

The savings product has lasting effects on wealth. Workers spent a large fraction of their savings on durable goods, especially goods used for home improvements. Four months after the scheme ended, they owned 10% more assets overall, and 34% more of the iron sheeting used to improve roofs. We then let treatment-group workers participate in the savings product two more times, and followed up nine months after the lump sum payout for the last round. Treatment-group workers ended up 10% more likely to have improved metal roofs on their homes.*

outcomes
Bar graphs showing the treatment effect of the deferred wages scheme on our main outcome variables

This “Pay Me Later” product was unusually popular and successful for a savings intervention, which typically have low takeup and utilization and rarely have downstream effects.** What made this product work so well? We ran a set of additional choice experiments to figure out which features drove the high demand for this form of savings.

The first key feature is paying out the savings in a lump sum. When we offered a version of the scheme that paid out the savings smoothly (in six weekly installments after the end of the season), takeup fell by 35%. The second is the automatic “deposits” that are built into the design. We offered some workers an identical product that differed only in that deposits were manual: a project staffer was located adjacent to the payroll site to accept deposits. Signup for this manual-deposits version matched the original scheme but actual utilization was much lower. Heterogeneous treatment effects tests suggest that automatic deposits are beneficial in part because they help workers overcome self-control problems.

On the other hand, the seasonal timing of the product was much less important for driving demand: it was just about as popular during the offseason as the main harvest season. Relaxing the commitment savings aspect of the product also doesn’t matter much. When we offered a version of the product where workers could access the funds at any time during the season, it was just as popular as the original version where the funds were locked away.

In summary, letting people opt in to get paid later is a very promising way to help them save money. It can be run at nearly zero marginal cost, once the payroll system is designed to accommodate it and people are signed up. The benefits are substantial: it’s very popular and leads to meaningful increases in wealth.  It could potentially be deployed not just by firms but also by governments running cash transfer programs and workfare schemes.

The success of “Pay Me Later” highlights the importance of paying attention to the solutions people in developing countries are already finding to the malfunctioning markets hindering their lives. Eric, Lasse, and I did a lot of work to design the experiment, and our field team (particularly Ndema Longwe and Rachel Sander) and the management at the Lujeri Tea Estate deserve credit for making the research and the project work.*** But a lot of credit also should go to the workers who asked us not to pay them yet—this is their idea, and it worked extremely well.

*These results are robust to correction for multiple hypothesis testing using the FWER adjustment of Haushofer and Shapiro (2016).
**Two exceptions to this rule are Dupas and Robinson (2013) and Schaner (2018).
***This work also would not have been possible without generous funding from the Financial Services for the Poor Research Fund at Innovations for Poverty Action, sponsored by a grant from the Bill and Melinda Gates Foundation. My own time on the project was funded in part by the USDA National
Institute of Food and Agriculture, Hatch project MIN14-164.

Making the Grade: The Sensitivity of Education Program Effectiveness to Input Choices and Outcome Measures

I’m very happy to announce that my paper with Rebecca Thornton, “Making the Grade: The Sensitivity of Education Program Effectiveness to Input Choices and Outcome Measures”, has been accepted by the Review of Economics and Statistics. An un-gated copy of the final pre-print is available here.

Here’s the abstract of the paper:

This paper demonstrates the acute sensitivity of education program effectiveness to the choices of inputs and outcome measures, using a randomized evaluation of a mother-tongue literacy program. The program raises reading scores by 0.64SDs and writing scores by 0.45SDs. A reduced-cost version instead yields statistically-insignificant reading gains and some large negative effects (-0.33SDs) on advanced writing. We combine a conceptual model of education production with detailed classroom observations to examine the mechanisms driving the results; we show they could be driven by the program initially lowering productivity before raising it, and potentially by missing complementary inputs in the reduced-cost version. 

The program we study, the Northern Uganda Literacy Project, is one of the most effective education interventions in the world. It is at the 99th percentile of the distribution of treatment effects in McEwan (2015), and would rank as the single most effective for improving reading. It improves reading scores by 0.64 standard deviations. Using the Evans and Yuan equivalent-years-of-schooling conversion, that is as much as we’d expect students to improve in three years of school under the status quo. It is over four times as much as the control-group students improve from the beginning to the end of the school year in our study.

mtg_reading
Effects of the NULP intervention on reading scores (in control-group SDs)

It is also expensive: it costs nearly $20 per student, more than twice as much as the average intervention for which cost data is available. So we worked with Mango Tree, the organization that developed it, to design a reduced-cost version. This version cut costs by getting rid of less-essential materials, and also by shifting to a train-the-trainers model of program delivery. It was somewhat less effective for improving reading scores (see above), and for the basic writing skill of name-writing, but actually backfired for some measures of writing skills:

mtg_writing
Effects of the NULP intervention on writing scores (in control-group SDs)

This means that the relative cost-effectiveness of the two versions of the program is highly sensitive to which outcome measure we use. Focusing just on the most-basic skill of letter name recognition makes the cheaper version look great—but its cost effectiveness is negative when we look at writing skills.

mtg_cost_effectiveness

Why did this happen? The intervention was delivered as a package, and we couldn’t test the components separately for two reasons. Resource constraints meant that we didn’t have enough schools to test all the many different combinations of inputs. More important, practical constraints make it hard to separate some inputs from one another. For example, the intervention involves intensive teacher training and support. That training relies on the textbooks, and could not be delivered without them.

Instead, we develop a model of education production with multiple inputs and outputs, and show that there are several mechanisms that could lead to a reduction in inputs not just lowering the treatment effects of the program, but actually leading to declines in some education outcomes. First, if the intervention raises productivity more for one outcome more than another, this can lead to a decline in the second outcome due to a substitution effect. Second, a similar pattern can occur if inputs are complements in producing certain skills and one is omitted. Third, the program may actually make teachers less productive in the short term, as part of overhauling their teaching methods—a so-called “J-curve”.

We find the strongest evidence for this third mechanism. Productivity for writing, in terms of learning gains per minute, actually falls in the reduced-cost schools. It is plausible that the reduced-cost version of the program pushed teachers onto the negative portion of the J-curve, but didn’t do enough to get them into the region of gains. In contrast, for reading (and for both skills in the full-cost version) the program provided a sufficient push to achieve gains.

There is also some evidence of that missing complementary inputs were important for the backfiring of the reduced-cost program. Some of the omitted inputs are designed to be complements—for example, slates that students can use to practice writing with chalk. Moreover, we find that classroom behaviors by teachers and students have little predictive power for test scores when entered linearly, but allowing for non-linear terms and interactions leads to a much higher R-squared. Notably, the machine-learning methods we apply indicate that the greatest predictive power comes from interactions between variables.

These findings are an important cautionary tale for policymakers who are interested in using successful education programs, but worried about their costs. Cutting costs by stripping out inputs may not just reduce a program’s effectiveness, but actually make it worse than doing nothing at all.

For more details, check out the paper here. Comments are welcome—while this paper is already published, Rebecca and I (along with Julie Buhl-Wiggers and Jeff Smith) are working on a number of followup papers based on the same dataset.

A Nobel Prize for Development Economics as an Experimental Science

Fifteen years ago I was an undergrad physics major, and I had just finished a summer spent teaching schoolchildren in Tanzania about HIV. The trip was both inspiring and demoralizing. I had gotten involved because I knew AIDS was important and thought addressing it was a silver bullet to solve all of sub-Saharan Africa’s problems. I came away from the trip having probably accomplished little, but learned a lot about the tangled constellation of challenges facing Tanzanians. They lacked access to higher education, to power, to running water. AIDS was a big problem, but one of many. And could we do anything about these issues? Most of my courses on international development were at best descriptive and at worst defeatist. There were lots of problems, and colonialism was to blame. Or maybe the oil curse. Or trade policy. It was hard to tell.

Just as I was pondering these problems and what I could do about them, talk began to spread about the incredible work being done by Abhijit Banerjee and Esther Duflo. They had started an organization, J-PAL, that was running actual experiments to study solutions to economic and social problems in the world’s poorest places. At this point, my undergraduate courses still emphasized that economics was not an experimental science. But I started reading about this new movement to change that, in development economics in particular, by using RCTs to test the effects of programs and answer first-order economic questions.

At the same time, I also learned about the work being done by Michael Kremer, another of the architects of the experimental revolution in development economics. One of the first development RCT papers I read remains my all-time favorite economics paper: Ted Miguel and Kremer‘s Worms. This paper has it all. They study a specific & important program, and answer first-order questions in health economics. They use a randomized trial, but their analysis is informed by economic theory: because intestinal worm treatment has positive externalities, you will drastically understate the benefits of treatment if you ignore that in your data analysis. And the results were hugely influential: Deworm the World is now implementing school-based deworming around the world. I was sold: I changed career paths and started pursuing development economics. And I became what is often called a randomista, a researcher focused on using randomized trials to study economic issues and solve policy problems in poor countries. Kremer is in fact my academic grandfather: he advised Rebecca Thornton, who in turn advised me.

When the Nobel Prize in Economics was awarded to Banerjee, Duflo, and Kremer this Monday, a major reason was because of their tremendous influence on hundreds if not thousands of people with stories like mine. Without their influence, the field of development economics would look entirely different. A huge share of us wouldn’t be economists at all, and if we were we would be doing entirely different things. Beyond development economics per se, the RCT revolution spilled over into other fields. We increasingly think of economics as an experimental science (which was the title of my dissertation) – even when we cannot run actual experiments, we think about our data analysis as approximating an experimental ideal. Field experiments have been used in economics for a long time, but this years prize-winners helped make them into the gold standard for empirical work in the field.

They also helped make experiments the gold standard in studying development interventions, and this has been a colossal change in how we try to help the poor. Whereas once policymakers and donors had to be convinced by researchers that rigorous impact evaluations were important, now they actually seek out research partners to study their ideas. This has meant that we increasingly know what actually works in development, and even more important, what doesn’t work. We can rigorously show that many silver bullets aren’t so shiny after all – for example, additional expansions of microcredit do not transform the lives of the poor.

What is particularly striking and praiseworthy about this award is how early it came. There was a consensus that this trio would win a Nobel prize at some point, but these awards tend to be handed out well after the fact, once time has made researchers’ impact on the field clearer. It is a testament to their tremendous impact on the field of economics that it was already obvious that Duflo, Banerjee, and Kremer were worthy of the Nobel prize, and a credit to the committee that they saw fit to recognize the contributions so quickly. I think it’s fitting that Duflo is now the youngest person ever to win a Nobel prize in economics – given her influence on the field, it’s hard to believe she is just 46 years old.

“Pay Me Later”: A simple, cheap, and surprisingly effective savings technology

Why would you ask your employer not to pay you yet? This is something I would personally never do. If I don’t want to spend money yet, I can just keep it in a bank account. But it’s a fairly common request in developing countries: my own field staff have asked this of me several times, and dairy farmers in Kenya will actually accept lower pay in order to put off getting paid.

The logic here is simple. In developed economies, savings earns a positive return, but in much of the developing world, people face a negative effective interest rate on their savings. Banks are loaded with transaction costs and hidden fees, and money hidden elsewhere could be stolen or lost. So deferred wages can be a very attractive way to save money until you actually want to spend it.

Lasse Brune, Eric Chyn, and I just finished a paper that takes that idea and turns it into a practical savings product for employees of a tea company in Malawi. Workers could choose to sign up and have a fraction of their pay withheld each payday, to be paid out in a lump sum at the end of the three-month harvest season.  About 52% of workers chose to sign up for the product; this choice was implemented at random for half of them. Workers who signed up saved 14% of their income in the scheme and increased their net savings by 24%.

dw balances
Accumulation of money in the deferred wages account over the course of the harvest season. The lump-sum payout was on April 30th.

The savings product has lasting effects on wealth. Workers spent a large fraction of their savings on durables, especially goods used for home improvements. Four months after the scheme ended, they owned 10% more assets overall, and 34% more of the iron sheeting used to improve roofs. We then let treatment-group workers participate in the savings product two more times, and followed up ten months after the lump sum payout for the last round. Treatment-group workers ended up 10% more likely to have improved metal roofs on their homes.*

This “Pay Me Later” product was unusually popular and successful for a savings intervention, which usually have low takeup and utilization and rarely have downstream effects.** What made this product work so well? We ran a set of additional choice experiments to figure out which features drove the high demand for this form of savings.

The first key feature is paying out the savings in a lump sum. When we offered a version of the scheme that paid out the savings smoothly (in six weekly installments) takeup fell to just 36%. The second is the automatic “deposits” that are built into the design. We offered some workers an identical product that differed only in that deposits were manual: a project staffer was located adjacent to the payroll site to accept deposits. Signup matched the original scheme but actual utilization was much lower.

On the other hand, the seasonal timing of the product was much less important for driving demand: it was just about as popular during the offseason as the main harvest season. The commitment savings aspect of the product also doesn’t matter much. When we offered a version of the product where workers could access the funds at any time during the season, it was just as popular as the original version where the funds were locked away.

In summary, letting people opt in to get paid later is a very promising way to help them save money. It can be run at nearly zero marginal cost, once the payroll system is designed to accommodate it and people are signed up. The benefits are substantial: it’s very popular and leads to meaningful increases in wealth.  It could potentially be deployed not just by firms but also by governments running cash programs and workfare schemes.

The success of “Pay Me Later” highlights the importance of paying attention to the solutions people in developing countries are already finding to the malfunctioning markets hindering their lives. Eric, Lasse, and I did a lot of work to design the experiment, and our field team and the management at the Lujeri Tea Estate deserve credit for making the research and the project work. But a lot of credit also should go to the workers who asked us not to pay them yet – this is their idea, and it worked extremely well.

Check out the paper for more about the savings product and our findings (link).

*These results are robust to correction for multiple hypothesis testing using the FWER adjustment of Haushofer and Shapiro (2016).
**A partial exception is Schaner (2018), which finds that interest rate subsidies on savings accounts lead to increases in assets and income. However, the channel appears to be raising entrepreneurship rather than utilization of the accounts.

How Important is Temptation Spending? Maybe Less than We Thought

Poor people often have trouble saving money for a number of reasons: the banks they have access to are low-quality and expensive (and far away), saving is risky, and money that they do save is often eaten away by kin taxes. One reason that has featured prominently in theoretical explanations of poverty traps is “temptation spending” – goods like alcohol or tobacco that people can’t resist buying even though they’d really prefer not to. Intuitively, exposure to temptation reduces saving in two ways. First, it directly drains people’s cash holdings, so money they might have saved gets “wasted” on the good in question. Second, people realize that their future self will just waste their savings on temptation goods, so they don’t even try to save.

But how important is temptation spending in the economic lives of the poor? Together with Lasse Brune and my student Qingxiao Li, I have just completed a draft of a paper that tackles this question using data from a field experiment in Malawi. The short answer is: probably not very important after all.

One of our key contributions in the paper is to measure temptation spending by letting people define it for themselves. We do this two ways: first, we allow our subjects to list goods they are often tempted to buy or feel they waste money on, and then match that person-specific list of goods to a separate enumeration of items that they purchased. Second, we let people give the simple sum of money they spent that they felt was wasted. We also present several other potential definitions of temptation spending that are common in the literature, including the alcohol & tobacco definition, and also a combined index across all the definitions. The correlations between these measures are not very high: spending on alcohol & tobacco correlates with spending on self-designated temptation goods at just 0.07:

correlations

This is the result of people picking very different goods than policymakers or researchers might select as “temptation goods”. For example people commonly listed clothes as a temptation good, whereas alcohol was fairly uncommon.

We also show that direct exposure to a tempting environment does not significantly affect spending on temptation goods – let alone downstream outcomes. Our subjects were workers who received extra cash income during the agricultural offseason as part of our study. All workers received their pay at the largest local trading center, and some were randomly assigned to receive their pay during the market day (as opposed to the day before). This was the most-tempting environment commonly reported by the people in our study. Getting paid at the market didn’t move any of our measures of temptation spending and we can rule out meaningful effect sizes.

Why not? We go through a set of six possible explanations and find support for two of them. The first is substitution bias: the market where workers were paid was just one of several in the local area, some of which operated on the day the untreated workers were paid. It was feasible for them to go to the other markets to seek out temptation goods to buy, effectively undoing the treatment. This implies a very different model of temptation than we usually have in mind: it would mean that the purchases tempt you even if they are far away and you have to go seek them out.*

The second is pre-commitment to spending plans. If workers can find a way to mentally “tie their hands” by committing to spend their earnings on specific goods or services, they can mitigate the effects of temptation. We see some empirical evidence for this: the effects of the treatment are heterogeneous by whether workers have children enrolled in school. School fees are a common pre-planned expense in our setting; consistent with workers pre-committing to pay school fees, we see zero treatment effects for workers with children in school, and substantial positive effects for other workers.

Both of these explanations suggest that temptation spending is much less of a policy concern than we might have thought. The first story implies that specific exposure to a tempting environment may not matter at all – people will seek out tempting goods whether they are near them or not. The latter suggests that people can use either mental accounting or actual financial agreements to shield themselves from the risk of temptation spending.

There is much more in the paper, “How Important is Temptation Spending? Maybe Less than We Thought” – check it out by clicking here. Feedback and suggestions are very welcome!

*I have personally experienced this sort of temptation for Icees, which aren’t good for me but which I will go out of my way to obtain.

Do Literacy Programs Boost Reading at the Expense of Math Skills?

We recently got access to preliminary data on math exam scores from the randomized evaluation of the NULP. There are no effects of the program on average math scores. Even though that’s a null result, it’s a pretty exciting finding – let me explain why.

Below is a preliminary graph of the math results by study arm and grade level. This is for the main treated cohort of students, so P1 (first grade) is from 2014, P2 is 2015, P3 is 2016, and P4 is 2017. Because the exam changes over time, I am just showing the percent correct. Also, the exam got harder at higher grade levels. Thus you don’t see progress from year to year here, even though fourth-graders can definitely answer harder questions than first-graders. There are potentially some subtasks where a comparison can be done but even the subtasks got harder.

math_results_by_year

There are clearly no treatment effects on math scores in any grade. A regression analysis confirms this pattern.

Why would we have expected any effects? My own prior was a combination of three factors. I’ll explain each, and then what I think now:

  1. Advocates of the “reading to learn” model argue that if you build reading skills that helps you learn other things, so we should see positive spillovers from reading skills onto math.

However, it’s not clear how much reading is really going on in math classes in northern Uganda, so maybe this is not a concern.

  1. The “Heckman equation” model argues that soft skill investments early in life are critical for later-life gains. That might suggest a null effect here, since nothing the NULP does directly targets soft skills. If everything comes through the soft skill channel, other interventions will have limited positive spillovers and not persist.

The counterargument, of course, is that this model does not predict that other interventions will not help.

  1. If teachers are time-constrained, emphasizing reading more could lead to negative spillovers from the NULP onto non-targeted subjects. This is potentially a major concern – for example, Fryer and Holden (2012) find that incentivizing math tests leads to improvements in math ability, but declines in reading ability.

While this is a legitimate concern, it looks like the NULP did not suffer from this problem.

Now that we have the results, I think #1 is probably not a practical consideration in this context and at this grade level. #2 just doesn’t make strong predictions. So that leaves us with #3 as the only viable theory.

This is great news, because we have evidence that the NULP does not cause significant declines in performance on other subjects. That addresses a common question people have about the huge reading gains from the program that are documented in Kerwin and Thornton (2018). Did they happen because teachers stopped teaching math, or put less effort into it? We now know the answer is “no”. That bodes very well for the potential benefits of scaling up the NULP approach across Uganda and beyond.

 

This post was originally published on the Northern Uganda Literacy Program blog, and is cross-posted here with permission.