Ceteris Non Paribus

Ceteris Non Paribus is my personal blog, formerly hosted at nonparibus.wordpress.com and now found here. This blog is a place for me to put the ideas I have, and the stuff I come across, that I’ve managed to convince myself other people would be interested in seeing. See the About page for more on the reasons why I maintain a blog and the origin of the blog’s name.

My most recent posts can be found below, and a list of my most popular posts (based on recent views) is on the right.

Ceteris Non Paribus

Making the Grade: The Sensitivity of Education Program Effectiveness to Input Choices and Outcome Measures

I’m very happy to announce that my paper with Rebecca Thornton, “Making the Grade: The Sensitivity of Education Program Effectiveness to Input Choices and Outcome Measures”, has been accepted by the Review of Economics and Statistics. An un-gated copy of the final pre-print is available here.

Here’s the abstract of the paper:

This paper demonstrates the acute sensitivity of education program effectiveness to the choices of inputs and outcome measures, using a randomized evaluation of a mother-tongue literacy program. The program raises reading scores by 0.64SDs and writing scores by 0.45SDs. A reduced-cost version instead yields statistically-insignificant reading gains and some large negative effects (-0.33SDs) on advanced writing. We combine a conceptual model of education production with detailed classroom observations to examine the mechanisms driving the results; we show they could be driven by the program initially lowering productivity before raising it, and potentially by missing complementary inputs in the reduced-cost version. 

The program we study, the Northern Uganda Literacy Project, is one of the most effective education interventions in the world. It is at the 99th percentile of the distribution of treatment effects in McEwan (2015), and would rank as the single most effective for improving reading. It improves reading scores by 0.64 standard deviations. Using the Evans and Yuan equivalent-years-of-schooling conversion, that is as much as we’d expect students to improve in three years of school under the status quo. It is over four times as much as the control-group students improve from the beginning to the end of the school year in our study.

Effects of the NULP intervention on reading scores (in control-group SDs)

It is also expensive: it costs nearly $20 per student, more than twice as much as the average intervention for which cost data is available. So we worked with Mango Tree, the organization that developed it, to design a reduced-cost version. This version cut costs by getting rid of less-essential materials, and also by shifting to a train-the-trainers model of program delivery. It was somewhat less effective for improving reading scores (see above), and for the basic writing skill of name-writing, but actually backfired for some measures of writing skills:

Effects of the NULP intervention on writing scores (in control-group SDs)

This means that the relative cost-effectiveness of the two versions of the program is highly sensitive to which outcome measure we use. Focusing just on the most-basic skill of letter name recognition makes the cheaper version look great—but its cost effectiveness is negative when we look at writing skills.


Why did this happen? The intervention was delivered as a package, and we couldn’t test the components separately for two reasons. Resource constraints meant that we didn’t have enough schools to test all the many different combinations of inputs. More important, practical constraints make it hard to separate some inputs from one another. For example, the intervention involves intensive teacher training and support. That training relies on the textbooks, and could not be delivered without them.

Instead, we develop a model of education production with multiple inputs and outputs, and show that there are several mechanisms that could lead to a reduction in inputs not just lowering the treatment effects of the program, but actually leading to declines in some education outcomes. First, if the intervention raises productivity more for one outcome more than another, this can lead to a decline in the second outcome due to a substitution effect. Second, a similar pattern can occur if inputs are complements in producing certain skills and one is omitted. Third, the program may actually make teachers less productive in the short term, as part of overhauling their teaching methods—a so-called “J-curve”.

We find the strongest evidence for this third mechanism. Productivity for writing, in terms of learning gains per minute, actually falls in the reduced-cost schools. It is plausible that the reduced-cost version of the program pushed teachers onto the negative portion of the J-curve, but didn’t do enough to get them into the region of gains. In contrast, for reading (and for both skills in the full-cost version) the program provided a sufficient push to achieve gains.

There is also some evidence of that missing complementary inputs were important for the backfiring of the reduced-cost program. Some of the omitted inputs are designed to be complements—for example, slates that students can use to practice writing with chalk. Moreover, we find that classroom behaviors by teachers and students have little predictive power for test scores when entered linearly, but allowing for non-linear terms and interactions leads to a much higher R-squared. Notably, the machine-learning methods we apply indicate that the greatest predictive power comes from interactions between variables.

These findings are an important cautionary tale for policymakers who are interested in using successful education programs, but worried about their costs. Cutting costs by stripping out inputs may not just reduce a program’s effectiveness, but actually make it worse than doing nothing at all.

For more details, check out the paper here. Comments are welcome—while this paper is already published, Rebecca and I (along with Julie Buhl-Wiggers and Jeff Smith) are working on a number of followup papers based on the same dataset.

A Nobel Prize for Development Economics as an Experimental Science

Fifteen years ago I was an undergrad physics major, and I had just finished a summer spent teaching schoolchildren in Tanzania about HIV. The trip was both inspiring and demoralizing. I had gotten involved because I knew AIDS was important and thought addressing it was a silver bullet to solve all of sub-Saharan Africa’s problems. I came away from the trip having probably accomplished little, but learned a lot about the tangled constellation of challenges facing Tanzanians. They lacked access to higher education, to power, to running water. AIDS was a big problem, but one of many. And could we do anything about these issues? Most of my courses on international development were at best descriptive and at worst defeatist. There were lots of problems, and colonialism was to blame. Or maybe the oil curse. Or trade policy. It was hard to tell.

Just as I was pondering these problems and what I could do about them, talk began to spread about the incredible work being done by Abhijit Banerjee and Esther Duflo. They had started an organization, J-PAL, that was running actual experiments to study solutions to economic and social problems in the world’s poorest places. At this point, my undergraduate courses still emphasized that economics was not an experimental science. But I started reading about this new movement to change that, in development economics in particular, by using RCTs to test the effects of programs and answer first-order economic questions.

At the same time, I also learned about the work being done by Michael Kremer, another of the architects of the experimental revolution in development economics. One of the first development RCT papers I read remains my all-time favorite economics paper: Ted Miguel and Kremer‘s Worms. This paper has it all. They study a specific & important program, and answer first-order questions in health economics. They use a randomized trial, but their analysis is informed by economic theory: because intestinal worm treatment has positive externalities, you will drastically understate the benefits of treatment if you ignore that in your data analysis. And the results were hugely influential: Deworm the World is now implementing school-based deworming around the world. I was sold: I changed career paths and started pursuing development economics. And I became what is often called a randomista, a researcher focused on using randomized trials to study economic issues and solve policy problems in poor countries. Kremer is in fact my academic grandfather: he advised Rebecca Thornton, who in turn advised me.

When the Nobel Prize in Economics was awarded to Banerjee, Duflo, and Kremer this Monday, a major reason was because of their tremendous influence on hundreds if not thousands of people with stories like mine. Without their influence, the field of development economics would look entirely different. A huge share of us wouldn’t be economists at all, and if we were we would be doing entirely different things. Beyond development economics per se, the RCT revolution spilled over into other fields. We increasingly think of economics as an experimental science (which was the title of my dissertation) – even when we cannot run actual experiments, we think about our data analysis as approximating an experimental ideal. Field experiments have been used in economics for a long time, but this years prize-winners helped make them into the gold standard for empirical work in the field.

They also helped make experiments the gold standard in studying development interventions, and this has been a colossal change in how we try to help the poor. Whereas once policymakers and donors had to be convinced by researchers that rigorous impact evaluations were important, now they actually seek out research partners to study their ideas. This has meant that we increasingly know what actually works in development, and even more important, what doesn’t work. We can rigorously show that many silver bullets aren’t so shiny after all – for example, additional expansions of microcredit do not transform the lives of the poor.

What is particularly striking and praiseworthy about this award is how early it came. There was a consensus that this trio would win a Nobel prize at some point, but these awards tend to be handed out well after the fact, once time has made researchers’ impact on the field clearer. It is a testament to their tremendous impact on the field of economics that it was already obvious that Duflo, Banerjee, and Kremer were worthy of the Nobel prize, and a credit to the committee that they saw fit to recognize the contributions so quickly. I think it’s fitting that Duflo is now the youngest person ever to win a Nobel prize in economics – given her influence on the field, it’s hard to believe she is just 46 years old.

“Pay Me Later”: A simple, cheap, and surprisingly effective savings technology

Why would you ask your employer not to pay you yet? This is something I would personally never do. If I don’t want to spend money yet, I can just keep it in a bank account. But it’s a fairly common request in developing countries: my own field staff have asked this of me several times, and dairy farmers in Kenya will actually accept lower pay in order to put off getting paid.

The logic here is simple. In developed economies, savings earns a positive return, but in much of the developing world, people face a negative effective interest rate on their savings. Banks are loaded with transaction costs and hidden fees, and money hidden elsewhere could be stolen or lost. So deferred wages can be a very attractive way to save money until you actually want to spend it.

Lasse Brune, Eric Chyn, and I just finished a paper that takes that idea and turns it into a practical savings product for employees of a tea company in Malawi. Workers could choose to sign up and have a fraction of their pay withheld each payday, to be paid out in a lump sum at the end of the three-month harvest season.  About 52% of workers chose to sign up for the product; this choice was implemented at random for half of them. Workers who signed up saved 14% of their income in the scheme and increased their net savings by 24%.

dw balances
Accumulation of money in the deferred wages account over the course of the harvest season. The lump-sum payout was on April 30th.

The savings product has lasting effects on wealth. Workers spent a large fraction of their savings on durables, especially goods used for home improvements. Four months after the scheme ended, they owned 10% more assets overall, and 34% more of the iron sheeting used to improve roofs. We then let treatment-group workers participate in the savings product two more times, and followed up ten months after the lump sum payout for the last round. Treatment-group workers ended up 10% more likely to have improved metal roofs on their homes.*

This “Pay Me Later” product was unusually popular and successful for a savings intervention, which usually have low takeup and utilization and rarely have downstream effects.** What made this product work so well? We ran a set of additional choice experiments to figure out which features drove the high demand for this form of savings.

The first key feature is paying out the savings in a lump sum. When we offered a version of the scheme that paid out the savings smoothly (in six weekly installments) takeup fell to just 36%. The second is the automatic “deposits” that are built into the design. We offered some workers an identical product that differed only in that deposits were manual: a project staffer was located adjacent to the payroll site to accept deposits. Signup matched the original scheme but actual utilization was much lower.

On the other hand, the seasonal timing of the product was much less important for driving demand: it was just about as popular during the offseason as the main harvest season. The commitment savings aspect of the product also doesn’t matter much. When we offered a version of the product where workers could access the funds at any time during the season, it was just as popular as the original version where the funds were locked away.

In summary, letting people opt in to get paid later is a very promising way to help them save money. It can be run at nearly zero marginal cost, once the payroll system is designed to accommodate it and people are signed up. The benefits are substantial: it’s very popular and leads to meaningful increases in wealth.  It could potentially be deployed not just by firms but also by governments running cash programs and workfare schemes.

The success of “Pay Me Later” highlights the importance of paying attention to the solutions people in developing countries are already finding to the malfunctioning markets hindering their lives. Eric, Lasse, and I did a lot of work to design the experiment, and our field team and the management at the Lujeri Tea Estate deserve credit for making the research and the project work. But a lot of credit also should go to the workers who asked us not to pay them yet – this is their idea, and it worked extremely well.

Check out the paper for more about the savings product and our findings (link).

*These results are robust to correction for multiple hypothesis testing using the FWER adjustment of Haushofer and Shapiro (2016).
**A partial exception is Schaner (2018), which finds that interest rate subsidies on savings accounts lead to increases in assets and income. However, the channel appears to be raising entrepreneurship rather than utilization of the accounts.

How Important is Temptation Spending? Maybe Less than We Thought

Poor people often have trouble saving money for a number of reasons: the banks they have access to are low-quality and expensive (and far away), saving is risky, and money that they do save is often eaten away by kin taxes. One reason that has featured prominently in theoretical explanations of poverty traps is “temptation spending” – goods like alcohol or tobacco that people can’t resist buying even though they’d really prefer not to. Intuitively, exposure to temptation reduces saving in two ways. First, it directly drains people’s cash holdings, so money they might have saved gets “wasted” on the good in question. Second, people realize that their future self will just waste their savings on temptation goods, so they don’t even try to save.

But how important is temptation spending in the economic lives of the poor? Together with Lasse Brune and my student Qingxiao Li, I have just completed a draft of a paper that tackles this question using data from a field experiment in Malawi. The short answer is: probably not very important after all.

One of our key contributions in the paper is to measure temptation spending by letting people define it for themselves. We do this two ways: first, we allow our subjects to list goods they are often tempted to buy or feel they waste money on, and then match that person-specific list of goods to a separate enumeration of items that they purchased. Second, we let people give the simple sum of money they spent that they felt was wasted. We also present several other potential definitions of temptation spending that are common in the literature, including the alcohol & tobacco definition, and also a combined index across all the definitions. The correlations between these measures are not very high: spending on alcohol & tobacco correlates with spending on self-designated temptation goods at just 0.07:


This is the result of people picking very different goods than policymakers or researchers might select as “temptation goods”. For example people commonly listed clothes as a temptation good, whereas alcohol was fairly uncommon.

We also show that direct exposure to a tempting environment does not significantly affect spending on temptation goods – let alone downstream outcomes. Our subjects were workers who received extra cash income during the agricultural offseason as part of our study. All workers received their pay at the largest local trading center, and some were randomly assigned to receive their pay during the market day (as opposed to the day before). This was the most-tempting environment commonly reported by the people in our study. Getting paid at the market didn’t move any of our measures of temptation spending and we can rule out meaningful effect sizes.

Why not? We go through a set of six possible explanations and find support for two of them. The first is substitution bias: the market where workers were paid was just one of several in the local area, some of which operated on the day the untreated workers were paid. It was feasible for them to go to the other markets to seek out temptation goods to buy, effectively undoing the treatment. This implies a very different model of temptation than we usually have in mind: it would mean that the purchases tempt you even if they are far away and you have to go seek them out.*

The second is pre-commitment to spending plans. If workers can find a way to mentally “tie their hands” by committing to spend their earnings on specific goods or services, they can mitigate the effects of temptation. We see some empirical evidence for this: the effects of the treatment are heterogeneous by whether workers have children enrolled in school. School fees are a common pre-planned expense in our setting; consistent with workers pre-committing to pay school fees, we see zero treatment effects for workers with children in school, and substantial positive effects for other workers.

Both of these explanations suggest that temptation spending is much less of a policy concern than we might have thought. The first story implies that specific exposure to a tempting environment may not matter at all – people will seek out tempting goods whether they are near them or not. The latter suggests that people can use either mental accounting or actual financial agreements to shield themselves from the risk of temptation spending.

There is much more in the paper, “How Important is Temptation Spending? Maybe Less than We Thought” – check it out by clicking here. Feedback and suggestions are very welcome!

*I have personally experienced this sort of temptation for Icees, which aren’t good for me but which I will go out of my way to obtain.

Do Literacy Programs Boost Reading at the Expense of Math Skills?

We recently got access to preliminary data on math exam scores from the randomized evaluation of the NULP. There are no effects of the program on average math scores. Even though that’s a null result, it’s a pretty exciting finding – let me explain why.

Below is a preliminary graph of the math results by study arm and grade level. This is for the main treated cohort of students, so P1 (first grade) is from 2014, P2 is 2015, P3 is 2016, and P4 is 2017. Because the exam changes over time, I am just showing the percent correct. Also, the exam got harder at higher grade levels. Thus you don’t see progress from year to year here, even though fourth-graders can definitely answer harder questions than first-graders. There are potentially some subtasks where a comparison can be done but even the subtasks got harder.


There are clearly no treatment effects on math scores in any grade. A regression analysis confirms this pattern.

Why would we have expected any effects? My own prior was a combination of three factors. I’ll explain each, and then what I think now:

  1. Advocates of the “reading to learn” model argue that if you build reading skills that helps you learn other things, so we should see positive spillovers from reading skills onto math.

However, it’s not clear how much reading is really going on in math classes in northern Uganda, so maybe this is not a concern.

  1. The “Heckman equation” model argues that soft skill investments early in life are critical for later-life gains. That might suggest a null effect here, since nothing the NULP does directly targets soft skills. If everything comes through the soft skill channel, other interventions will have limited positive spillovers and not persist.

The counterargument, of course, is that this model does not predict that other interventions will not help.

  1. If teachers are time-constrained, emphasizing reading more could lead to negative spillovers from the NULP onto non-targeted subjects. This is potentially a major concern – for example, Fryer and Holden (2012) find that incentivizing math tests leads to improvements in math ability, but declines in reading ability.

While this is a legitimate concern, it looks like the NULP did not suffer from this problem.

Now that we have the results, I think #1 is probably not a practical consideration in this context and at this grade level. #2 just doesn’t make strong predictions. So that leaves us with #3 as the only viable theory.

This is great news, because we have evidence that the NULP does not cause significant declines in performance on other subjects. That addresses a common question people have about the huge reading gains from the program that are documented in Kerwin and Thornton (2018). Did they happen because teachers stopped teaching math, or put less effort into it? We now know the answer is “no”. That bodes very well for the potential benefits of scaling up the NULP approach across Uganda and beyond.


This post was originally published on the Northern Uganda Literacy Program blog, and is cross-posted here with permission.

Income timing and liquidity constraints: Evidence from a randomized field experiment

I’m very happy to announce that my paper with Lasse Brune, “Income timing and liquidity constraints: Evidence from a randomized field experiment”, is now forthcoming at the Journal of Development Economics. You can access the paper for free until March 29th, 2019 by following this link. The accepted draft of the manuscript is also permanently hosted on my website here.

Here’s the abstract of our paper:

People in developing countries sometimes desire deferred income streams, which replace more-frequent income flows with a single, later lump sum. We study the effects of short-term wage deferral using a randomized experiment with participants in a temporary cash-for-work program. Workers who are assigned to lump-sum payments are five percentage points more likely to purchase a high-return investment. We discuss the role of both barriers to saving and credit constraints in explaining our results. While stated preferences for deferred payments suggest a role for savings constraints, the evidence is also consistent with a simpler model of credit constraints alone.

One of the basic findings of our paper is that the timing of income matters a lot for the timing of expenditure. Here’s a graph of expenditure by weekend for workers who were paid the same amount of money in weekly installments or a single lump sum at the end of the month:

Expenditures by study arm


This clearly suggests that workers are liquidity-constrained. And our workers’ stated preferences suggest that the binding constraint is actually a savings constraint rather than a credit constraint. That is, they’d like to save their money but are unable to. Getting paid in a later lump sum relaxes that constraint. Those stated preferences are consistent with Casaburi and Macchiavello’s paper in the latest AER – they find that dairy farmers in Kenya are willing to forgo profits in order to get paid in deferred lump sums.

The core contribution of our paper is to show that this switch from smooth payments to deferred lump sums has effects on actual investment decisions. At the end of each month of the study, we sold workers a bond that paid a 33% return in two weeks with certainty. Paying workers in deferred lump sums substantially raises the uptake of the bond:

bond effects

Despite the workers’ stated preferences, we cannot conclusively show whether these effects are due to credit constraints or savings constraints. We develop a model to show how the lump sum payments would affect bond purchases under different kinds of liquidity constraints. First, we show that our workers are definitely credit-constrained at the point of the bond purchase. If they weren’t, everyone would buy it and we would see no treatment effects. Then we show that during the period before the bond sales, our workers must be either credit- or savings-constrained.

If they are credit-constrained, then the people who were induced to buy the bond by the treatment are those who would prefer a smooth income profile but cannot borrow against their future lump-sum earnings. They end up with excess liquidity once their wages are paid, and they spend some of that on the bond. If they are savings-constrained, then the people who the treatment induced to buy the bond are those who would prefer a lumpier income profile. They want to save money, and the deferred wage payments let them do that.

What this means is that if our workers were credit-constrained, then our results imply that deferred wages made them worse off. The opposite is true if they were savings-constrained – the deferred lump sum payout relieved the savings constraint and made them better off.

So we’ve shown that deferred wage payments matter for investment choices, and Casaburi and Macchiavello show that some people definitely want deferred payments. But can deferred payments make people better off? That is, can they help people save money, and move outcomes like actual asset purchases? Lasse and I have teamed up with Eric Chyn to study that question, by turning deferred wage payments into an actual savings product that we offered to a set of workers at a tea company in Malawi. Our preliminary results are available here, and our evidence suggests that the demand for this savings product is high and that it leads to increases in actual physical asset ownership. Keep watching this space for more results from that project.

One of the things that’s satisfying about this paper is that it’s my first publication in an economics journal (I’ve published in health journals before). It was a long time coming: Lasse and I first started designing this study over six years ago. We collected the data in 2013, and early drafts were part of his dissertation in 2014 and mine in 2015. Despite the long period between collecting the data and publishing the paper, I came away from the publication process very satisfied. We were treated fairly and got timely and constructive reviews.

The JDE submission process was particularly positive. We got our first-round R&R in about two months, and our second-round R&R in about four. Most of the delay between our submitting the paper and the paper acceptance was the time we spent working on revising it. Moreover, the peer review process substantially improved the paper: the writing and argument is much tighter, and the link to economic theory is far clearer, than either were when we first submitted the paper.

We can do better than just giving cash to poor people. Here’s why that matters.

Cash transfers are an enormously valuable, and increasingly widespread, development intervention. Their value and popularity has driven a vast literature studying how various kinds of cash transfers (conditional, unconditional, cash-for-work, remittances) affect all sorts of outcomes (finances, health, education, job choice). I work in one small corner of this literature myself: Lasse Brune and I just finished a revision of our paper on how the frequency of cash payouts affects savings behavior, and we are currently studying (along with Eric Chyn) how to use that approach as an actual savings product.

After all the excitement over their potential benefits, a couple of recent results have taken a bit of the luster off of cash transfers. First, the three-year followup of the GiveDirectly evaluation in Kenya showed evidence that many effects had faded out, although asset ownership was still higher. Then came a nine-year (!!) followup of a cash grant program in Uganda, where initial gains in earnings had disappeared (but again, asset ownership remained higher).

One question raised by these results is whether we can do any better than just giving people cash. A new paper by McIntosh and Zeitlin tackles this question head-on, with careful comparisons between a sanitation-focused intervention and a cost-equivalent cash transfer. They actually tried a bunch of cash transfers in a range so that they could get the exact cost-equivalency through regression adjustment. In their study, there’s no clear rank ordering between cost-equivalent cash and the actual program; neither have big impacts, and they change different things (though providing a larger cash transfer does appear to dominate the program across all outcomes).

This is just one program, though – can any program beat cash? It turns out that the answer is yes! At MIEDC this spring, I saw Dean Karlan present results from a “Graduation” program that provided a package of interventions (training, mentoring, cash, and a savings group) in several different countries. The Uganda results, available here, show that the program significantly improved a wide range of poverty metrics, while a cost-equivalent cash transfer “did not appear to have meaningful impacts on poverty outcomes”.

This is a huge deal. The basic neoclassical model predicts that, at best, a program can never beat giving people cash, the best you can do is tie.* People know what they need and can use money to buy it. If you spend the same amount of money, you could achieve the same benefits for them if you happen to hit on exactly what they want, but if you pick anything else you would have done better to just hand them money. (This is the logic behind the annual Christmas tradition of journalists trotting out some economist to explain to the world why giving gifts is inefficient. And economists wonder why no one likes us!)

The fact that we can do better than just handing out cash to people is a rejection of that model in favor of models with multiple interlocking market failures – some of which may be psychological or “behavioral” in nature. That’s a validation of our basic understanding of why poor places stay poor. In a standard model, a simple lack of funds, or even the failure of one market, is not enough to drive a permanent poverty trap. You need multiple markets failing at once to keep people from escaping from poverty. For example, a lack of access to credit is bad, and will hurt entrepreneurs’ ability to make investments. But even without credit, they could instead save money to eventually make the same investments. A behavioral or social constraint that keeps them from saving, in contrast, can keep them from making those investments at all.

McIntosh and Zeitlin refer to Das, Do, and Ozler, who point out that “in the absence of external market imperfections, intra-household bargaining concerns, or behavioral inconsistencies, the outcomes moved by cash transfers are by definition those that maximize welfare impacts.” While their study finds that neither cash nor the program was a clear winner, the Graduation intervention package, in contrast, clearly beats an equivalent amount of cash on a whole host of metrics. We can account for this in two ways. One view is that the cash group actually was better off – people would really prefer to spend a windfall quickly than make a set of investments that pay off with longer-term gains. The other, which I ascribe to, is that there are other constraints at work here. Under this model, the cash group just couldn’t make those investments – they didn’t have the access to savings markets, or there is a missing market in training/skill development, etc.

There is an important practical implication as well. The notion of “benchmarking” development interventions by comparing them to handing out cash is growing in popularity, and it’s an important movement. Indeed, the McIntosh and Zeitlin study makes major contributions by figuring out how to do this benchmarking correctly, and by pushing the envelope on getting development agencies to think about cash as a benchmark.** But what do we do when there is no obvious way to benchmark via cash? In particular, when we are studying education interventions, who should we be thinking about making the cash transfers to? McIntosh and Zeitlin talk about a default of targeting the cash to the people targeted by the in-kind program. In many education programs, the teachers are the people targeted directly. In others, it is the school boards that are the direct recipients of an intervention. Neither group of people is really the aim of an education program: we want students to learn. And, perhaps unsurprisingly, direct cash transfers to teachers and school boards don’t do much to improve learning. You could change the targeting in this case, and give the cash to the students, or to their parents, or maybe just to their mothers – there turn out to be many possible ways of doing this.

So it’s really important that we now have an example of a program that clearly did better than a direct cash transfer. From a theoretical perspective, this is akin to Jensen and Miller’s discovery of Giffen goods in their 2008 paper about rice and wheat in China: it validates the way we have been trying to model persistent poverty. From the practical side, it raises our confidence that the other interventions we are doing are worthwhile, in contexts where benchmarking to cash is impractical, overly complicated, or simply hasn’t been tried. Perhaps we haven’t proven that teacher training is better than a cash transfer, but we do at least know that high-quality programs can be more valuable than simply handing out money.

EDIT: Ben Meiselman pointed out a typo in the original version of this post (I was missing “best” in “the best you can do is tie”), which I have corrected.

*I am ignoring spillovers onto people who don’t get the cash here, which, as Berk Ozler has pointed out, can be a big deal – and are often negative.

**Doing this remains controversial in the development sector – so much so that many of the other projects that are trying cash benchmarking are doing it in “stealth mode”.

How to quickly convert Powerpoint slides to Beamer (and indent the code nicely too)

Like most economists, I like to present my research using Beamer. This is in part for costly signaling reasons – doing my slides via TeX proves that I am smart/diligent enough to do that. But it’s also for stylistic reasons: Beamer can automatically put a little index at the top of my slides  so people know where I am going, and I like the default fonts and colors.

Moreover, Beamer forces me to obey the First Law of Slidemaking: get all those extra words off your slides. Powerpoint will happily rescale things and let you put tons of text on the screen at once. Beamer – unless you mess with it heavily – simply won’t, and so forces you to make short, parsimonious bullet points (and limit how many you use).

Not everyone is on the same page about which tool to use all the time, which in the past has occasionally meant I needed to take my coauthor’s Powerpoint slides and copy them into Beamer line-by-line. Fortunately, today I found a solution for automating that process.

StackExchange user Louis has a post where he shares VBA code that can quickly move your Powerpoint slides over to Beamer. His code is great but I wasn’t totally happy with the output so I made a couple of tweaks to simplify it a bit. You can view and download my code here; I provide it for free with no warranties, guarantees, or promises. Use at your own risk.

Here is how to use it:

  1. Convert your slides to .ppt format using “Save As”. (The code won’t work on .pptx files).
  2. Put the file in its own folder that contains nothing else. WARNING: If files with the same names as those used by the code are in this folder they will be overwritten.
  3. Download the VBA code here (use at your own risk).
  4. Open up the Macros menu in Powerpoint (You can add it via “Customize the Ribbon”. Hit “New Group” on the right and rename it “Macros”, then select “Macros” on the left and hit “Add”.)
  5. Type “ConvertToBeamer” under “Macro name”, then hit “Create”
  6. Select all the text in the window that appears and delete it. Paste the VBA code in.
  7. Save, then close the Microsoft Visual Basic for Applications window.
  8. Hit the Macros button again, select “ConvertToBeamer” and run it.
  9. There will now be a .txt file with the Beamer code for your slides in it. (It won’t compile without an appropriate header.) If your file is called “MySlides.ppt” the text file will be “MySlides.txt”
  10. You need to manually fix a few special characters, as always when importing text into TeX. Look out for $, %, carriage returns, and all types of quotation marks and apostrophes. I also found that some tables came through fine while others needed manual tweaking.

One issue I had with the output was that it didn’t have any indentations, making it hard to recognize nested bullets. Fortunately I found this page that will indent TeX code automatically.

I found this to be a huge time saver. Even with figuring it out for the first time, tweaking the code, and writing this post, it still probably saved me several hours of work. Hopefully others find this useful as well.

Simon Heß has a brand-new Stata package for randomization inference

After I shared my recent blog post about randomization inference (or RI), I got a number of requests for the Stata code I’ve used for my own RI tests. This sounded like a good idea to me, but also like a hassle for me. And my code isn’t designed to be easily used by other folks, so it would be a hassle for them as well.

Fortunately, a new Stata Journal article – and Stata package – came out the day after my post that does much better than any of my own code I could have shared. The article, by Simon H. Heß, is “Randomization inference with Stata: A guide and software”. It addresses a key problem with how economists typically handle RI currently:

Whenever researchers use randomization inference, they regularly code individual program routines, risking inconsistencies and coding mistakes.

This is a major concern. Another advantage of his new package is that the existence of a simple Stata command to do RI means that more researchers are likely to actually use it.

You can run findit ritest in Stata to get Simon’s package.

I’ve started trying out ritest with the same dataset on the literacy program, and it handles everything we need it to do quite well. Our stratified lottery and clustered sampling are taken care of by basic options for the program. We have multiple treatment arms, which ritest can handle by permuting our multi-valued Study_Arm variable and then using  Stata’s “i.” factor variable notation. We can then run a test for the difference between two different treatment effects by including “(_b[1.Study_Arm]-_b[2.Study_Arm])” in the list of expressions ritest computes. Highly recommended.

Randomization inference vs. bootstrapping for p-values

It’s a common conundrum in applied microeconomics. You ran an experiment on the universe of potential treatment schools in a given region, and you’re looking at school-level outcomes. Alternatively, you look at a policy that was idiosyncratically rolled out across US states, and you have the universe of state outcomes for your sample. What do the standard errors and p-values for my results even mean? After all, there’s no sampling error here, and the inference techniques we normally use in regression analyses are based on sampling error.

The answer is that the correct p-values to use are ones that capture uncertainty in terms of which units in your sample are assigned to the treatment group (instead of to the control group). As Athey and Imbens put it in their new handbook chapter on the econometrics of randomized experiments, “[W]e stress randomization-based inference as opposed to sampling-based inference. In randomization-based inference, uncertainty in estimates arises naturally from the random assignment of the treatments, rather than from hypothesized sampling from a large population.”

Athey and Imbens (2017) is part of an increasing push for economists to use randomization-based methods for doing causal inference. In particular, people looking at the results of field experiments are beginning to ask for p-values from randomization inference. As I have begun using this approach in my own work, and discussing it with my colleagues, I have encountered the common sentiment that “this is just bootstrapping”, or that it is extremely similar (indeed, it feels quite similar to me). While the randomization inference p-values are constructed similarly to bootstrapping-based p-values, there is a key difference that boils down to the distinction between the sampling-based and randomization-based approaches to inference:

Bootstrapped p-values are about uncertainty over the specific sample of the population you drew, while randomization inference p-values are about uncertainty over which units within your sample are assigned to the treatment.

When we bootstrap p-values, we appeal to the notion that we are working with a representative sample of the population to begin with. So we re-sample observations from our actual sample, with replacement, to simulate how sampling variation would affect our results.

In contrast, when we do randomization inference for p-values, this is based on the idea that the specific units in our sample that are treated are random. Thus there is some chance of a treatment-control difference in outcomes of any given magnitude simply based on which units are assigned to the treatment group – even if the treatment has no effect. So we re-assign “treatment” at random, to compute the probability of differences of various magnitudes under the null hypothesis that the treatment does nothing.

To be explicit about what this distinction means, below I lay out the procedure for computing p-values both ways, using my paper with Rebecca Thornton about a school-based literacy intervention in Uganda as an example data-generating process.

Randomization inference p-values

1. Randomly re-assign “treatment” in the same way that it was actually done. This was within strata of three schools (2 treatments and 1 control per cell). As we do this, the sample stays fixed.

2. Use the fake treatments to estimate our regression model:

y_{is}= \beta_0 +\beta_1 T1_s + \beta_2 T2_s + \textbf{L}^\prime_s\gamma +\eta y^{baseline}_{is} + \varepsilon_{is}

\textbf{L} are strata fixed effects.
The fake treatments have no effect (on average) by construction. There is some probability that they appear to have an effect by random chance. Our goal is to see where our point estimates lie within the distribution of “by random chance” point estimates from these simulations.

3. Store the estimates for \beta_1 and \beta_2.

4. Repeat 1000 times.

5. Look up the point estimates for our real data in the distribution of the 1000 fake treatment assignment simulations. Compute the share of the fake #s that are higher in absolute value than our point estimates. This is our randomization inference p-value.

Bootstrapped p-values

1. Randomly re-sample observations in the same way they were actually sampled. This was at the level of a school, which was our sampling unit. In every selected school we keep the original sample of kids.

This re-sampling is done with replacement, with a total sample equal to the number of schools in our actual dataset (38). Therefore almost all re-sampled datasets will have repeated copies of the same school. As we do this, the treatment status of any given school stays fixed.

2. Use the fake sample to estimate our regression model:

y_{is}= \beta_0 +\beta_1 T1_s + \beta_2 T2_s + \textbf{L}^\prime_s\gamma +\eta y^{baseline}_{is} + \varepsilon_{is}

\textbf{L} are strata fixed effects.

The treatments should in principle have the same average effect as they do in our real sample. Our goal is to see how much our point estimates vary as a result of sampling variation, using the re-sampled datasets as a simulation of the actual sampling variation in the population.

3. Store the estimates for \beta_1 and \beta_2.

4. Repeat 1000 times.

5. Compute the standard deviation of the estimates for \beta_1 and \beta_2 across the 1000 point estimates. This is our bootstrapped standard error. Use these, along with the point estimate from the real dataset, to do a two-sided t-test; the p-value from this test is our bootstrapped p-value.*


I found Matthew Blackwell’s lecture notes to be a very helpful guide on how randomization inference works. Lasse Brune and Jeff Smith provided useful feedback and comments on the randomization inference algorithm, but any mistakes in this post are mine alone. If you do spot an error, please let me know so I can fix it!

EDIT: Guido Imbens shared a new version of his paper with Alberto Abadie, Susan Athey, and Jeffrey Wooldrige about the issue of what standard errors mean when your sample includes the entire population of interest (link). Reading an earlier version really helped with my own understanding of this issue, and I have often recommended it to friends who are struggling to understand why they even need standard errors for their estimates if they have all 50 states, every worker at a firm, etc.

*There are a few other methods of getting bootstrapped p-values but the spirit is the same.