It’s a common conundrum in applied microeconomics. You ran an experiment on the universe of potential treatment schools in a given region, and you’re looking at school-level outcomes. Alternatively, you look at a policy that was idiosyncratically rolled out across US states, and you have the universe of state outcomes for your sample. What do the standard errors and p-values for my results even mean? After all, there’s no sampling error here, and the inference techniques we normally use in regression analyses are based on sampling error.
The answer is that the correct p-values to use are ones that capture uncertainty in terms of which units in your sample are assigned to the treatment group (instead of to the control group). As Athey and Imbens put it in their new handbook chapter on the econometrics of randomized experiments, “[W]e stress randomization-based inference as opposed to sampling-based inference. In randomization-based inference, uncertainty in estimates arises naturally from the random assignment of the treatments, rather than from hypothesized sampling from a large population.”
Athey and Imbens (2017) is part of an increasing push for economists to use randomization-based methods for doing causal inference. In particular, people looking at the results of field experiments are beginning to ask for p-values from randomization inference. As I have begun using this approach in my own work, and discussing it with my colleagues, I have encountered the common sentiment that “this is just bootstrapping”, or that it is extremely similar (indeed, it feels quite similar to me). While the randomization inference p-values are constructed similarly to bootstrapping-based p-values, there is a key difference that boils down to the distinction between the sampling-based and randomization-based approaches to inference:
Bootstrapped p-values are about uncertainty over the specific sample of the population you drew, while randomization inference p-values are about uncertainty over which units within your sample are assigned to the treatment.
When we bootstrap p-values, we appeal to the notion that we are working with a representative sample of the population to begin with. So we re-sample observations from our actual sample, with replacement, to simulate how sampling variation would affect our results.
In contrast, when we do randomization inference for p-values, this is based on the idea that the specific units in our sample that are treated are random. Thus there is some chance of a treatment-control difference in outcomes of any given magnitude simply based on which units are assigned to the treatment group – even if the treatment has no effect. So we re-assign “treatment” at random, to compute the probability of differences of various magnitudes under the null hypothesis that the treatment does nothing.
To be explicit about what this distinction means, below I lay out the procedure for computing p-values both ways, using my paper with Rebecca Thornton about a school-based literacy intervention in Uganda as an example data-generating process.
Randomization inference p-values
1. Randomly re-assign “treatment” in the same way that it was actually done. This was within strata of three schools (2 treatments and 1 control per cell). As we do this, the sample stays fixed.
2. Use the fake treatments to estimate our regression model:
3. Store the estimates for and .
4. Repeat 1000 times.
5. Look up the point estimates for our real data in the distribution of the 1000 fake treatment assignment simulations. Compute the share of the fake #s that are higher in absolute value than our point estimates. This is our randomization inference p-value.
1. Randomly re-sample observations in the same way they were actually sampled. This was at the level of a school, which was our sampling unit. In every selected school we keep the original sample of kids.
This re-sampling is done with replacement, with a total sample equal to the number of schools in our actual dataset (38). Therefore almost all re-sampled datasets will have repeated copies of the same school. As we do this, the treatment status of any given school stays fixed.
2. Use the fake sample to estimate our regression model:
are strata fixed effects.
3. Store the estimates for and .
4. Repeat 1000 times.
5. Compute the standard deviation of the estimates for and across the 1000 point estimates. This is our bootstrapped standard error. Use these, along with the point estimate from the real dataset, to do a two-sided t-test; the p-value from this test is our bootstrapped p-value.*
I found Matthew Blackwell’s lecture notes to be a very helpful guide on how randomization inference works. Lasse Brune and Jeff Smith provided useful feedback and comments on the randomization inference algorithm, but any mistakes in this post are mine alone. If you do spot an error, please let me know so I can fix it!
EDIT: Guido Imbens shared a new version of his paper with Alberto Abadie, Susan Athey, and Jeffrey Wooldrige about the issue of what standard errors mean when your sample includes the entire population of interest (link). Reading an earlier version really helped with my own understanding of this issue, and I have often recommended it to friends who are struggling to understand why they even need standard errors for their estimates if they have all 50 states, every worker at a firm, etc.
4 thoughts on “Randomization inference vs. bootstrapping for p-values”
Good to see economists picking up on these issues. I do think there may be some confusion here re testing vs estimation though — see this paper by Kari Lock Morgan, who does an excellent job in clarifying the issues:
Thanks! I’ll take a look.
I’m currently working in something very similar that you might find useful…
A comment from David Reinstein that he was having trouble posting for some reason:
Thanks. I see the mechanical difference in procedures. Can you (maybe in a later post) give us some intuition for
– when to use one versus the other
– what the different implications are
– when these come out very different in practice?
Also, my thought on the Athey/Imbens paper was
… we nearly always want to make inferences about the population that the treatment and control groups are taken from (even thinking about a hypothetical super-population), not about the impact on the sampled groups themselves. So, with this in mind, when would I still want to use randomization inference.
And in practice, if the sampling group is large, maybe the results will tend to be similar?