Randomization inference vs. bootstrapping for p-values

It’s a common conundrum in applied microeconomics. You ran an experiment on the universe of potential treatment schools in a given region, and you’re looking at school-level outcomes. Alternatively, you look at a policy that was idiosyncratically rolled out across US states, and you have the universe of state outcomes for your sample. What do the standard errors and p-values for my results even mean? After all, there’s no sampling error here, and the inference techniques we normally use in regression analyses are based on sampling error.

The answer is that the correct p-values to use are ones that capture uncertainty in terms of which units in your sample are assigned to the treatment group (instead of to the control group). As Athey and Imbens put it in their new handbook chapter on the econometrics of randomized experiments, “[W]e stress randomization-based inference as opposed to sampling-based inference. In randomization-based inference, uncertainty in estimates arises naturally from the random assignment of the treatments, rather than from hypothesized sampling from a large population.”

Athey and Imbens (2017) is part of an increasing push for economists to use randomization-based methods for doing causal inference. In particular, people looking at the results of field experiments are beginning to ask for p-values from randomization inference. As I have begun using this approach in my own work, and discussing it with my colleagues, I have encountered the common sentiment that “this is just bootstrapping”, or that it is extremely similar (indeed, it feels quite similar to me). While the randomization inference p-values are constructed similarly to bootstrapping-based p-values, there is a key difference that boils down to the distinction between the sampling-based and randomization-based approaches to inference:

Bootstrapped p-values are about uncertainty over the specific sample of the population you drew, while randomization inference p-values are about uncertainty over which units within your sample are assigned to the treatment.

When we bootstrap p-values, we appeal to the notion that we are working with a representative sample of the population to begin with. So we re-sample observations from our actual sample, with replacement, to simulate how sampling variation would affect our results.

In contrast, when we do randomization inference for p-values, this is based on the idea that the specific units in our sample that are treated are random. Thus there is some chance of a treatment-control difference in outcomes of any given magnitude simply based on which units are assigned to the treatment group – even if the treatment has no effect. So we re-assign “treatment” at random, to compute the probability of differences of various magnitudes under the null hypothesis that the treatment does nothing.

To be explicit about what this distinction means, below I lay out the procedure for computing p-values both ways, using my paper with Rebecca Thornton about a school-based literacy intervention in Uganda as an example data-generating process.

Randomization inference p-values

1. Randomly re-assign “treatment” in the same way that it was actually done. This was within strata of three schools (2 treatments and 1 control per cell). As we do this, the sample stays fixed.

2. Use the fake treatments to estimate our regression model:

y_{is}= \beta_0 +\beta_1 T1_s + \beta_2 T2_s + \textbf{L}^\prime_s\gamma +\eta y^{baseline}_{is} + \varepsilon_{is}

\textbf{L} are strata fixed effects.
The fake treatments have no effect (on average) by construction. There is some probability that they appear to have an effect by random chance. Our goal is to see where our point estimates lie within the distribution of “by random chance” point estimates from these simulations.

3. Store the estimates for \beta_1 and \beta_2.

4. Repeat 1000 times.

5. Look up the point estimates for our real data in the distribution of the 1000 fake treatment assignment simulations. Compute the share of the fake #s that are higher in absolute value than our point estimates. This is our randomization inference p-value.

Bootstrapped p-values

1. Randomly re-sample observations in the same way they were actually sampled. This was at the level of a school, which was our sampling unit. In every selected school we keep the original sample of kids.

This re-sampling is done with replacement, with a total sample equal to the number of schools in our actual dataset (38). Therefore almost all re-sampled datasets will have repeated copies of the same school. As we do this, the treatment status of any given school stays fixed.

2. Use the fake sample to estimate our regression model:

y_{is}= \beta_0 +\beta_1 T1_s + \beta_2 T2_s + \textbf{L}^\prime_s\gamma +\eta y^{baseline}_{is} + \varepsilon_{is}

\textbf{L} are strata fixed effects.

The treatments should in principle have the same average effect as they do in our real sample. Our goal is to see how much our point estimates vary as a result of sampling variation, using the re-sampled datasets as a simulation of the actual sampling variation in the population.

3. Store the estimates for \beta_1 and \beta_2.

4. Repeat 1000 times.

5. Compute the standard deviation of the estimates for \beta_1 and \beta_2 across the 1000 point estimates. This is our bootstrapped standard error. Use these, along with the point estimate from the real dataset, to do a two-sided t-test; the p-value from this test is our bootstrapped p-value.*


I found Matthew Blackwell’s lecture notes to be a very helpful guide on how randomization inference works. Lasse Brune and Jeff Smith provided useful feedback and comments on the randomization inference algorithm, but any mistakes in this post are mine alone. If you do spot an error, please let me know so I can fix it!

EDIT: Guido Imbens shared a new version of his paper with Alberto Abadie, Susan Athey, and Jeffrey Wooldrige about the issue of what standard errors mean when your sample includes the entire population of interest (link). Reading an earlier version really helped with my own understanding of this issue, and I have often recommended it to friends who are struggling to understand why they even need standard errors for their estimates if they have all 50 states, every worker at a firm, etc.

*There are a few other methods of getting bootstrapped p-values but the spirit is the same.

4 thoughts on “Randomization inference vs. bootstrapping for p-values”

  1. A comment from David Reinstein that he was having trouble posting for some reason:

    Thanks. I see the mechanical difference in procedures. Can you (maybe in a later post) give us some intuition for
    – when to use one versus the other
    – what the different implications are
    – when these come out very different in practice?

    Also, my thought on the Athey/Imbens paper was
    … we nearly always want to make inferences about the population that the treatment and control groups are taken from (even thinking about a hypothetical super-population), not about the impact on the sampled groups themselves. So, with this in mind, when would I still want to use randomization inference.

    And in practice, if the sampling group is large, maybe the results will tend to be similar?

Leave a Reply

Your email address will not be published. Required fields are marked *