If you want to study the effect of a variable *x* on an outcome *y*, there are two broad strategies. One is to run a randomized experiment or one of its close cousins like a regression discontinuity or a difference-in-differences. The other is to adjust for observable differences in the data that are related to *x *and* y – *a list of variables that I’ll denote as *W*. For example, if you want to estimate the effect of education on wages, you typically want include gender in *Z *(among many other things). Control for enough observable characteristics and you can sometimes claim that you have isolated the causal effect of *x* on *y* – you’ve distilled the causation out of the correlation.

This approach has led to no end of problems in data analysis, especially in social science. It relies on an assumption that many researchers seem to ignore, which is that there are no other factors that we do not include in *Z *that are related to both *y* and *x*.* That’s an assumption that is often violated.

This post is motivated by another problem that I see all too often in empirical work. People seem to have little idea how to select variables for inclusion in *Z, *and, critically, don’t understand what *not* to include in *Z*. A key point in knowing what not to control for is the maxim in the title of this post:

*Don’t control for outcome variables.*

For example, if you want to know how a student’s grades are affected by their parents’ spending on their college education, you might control for race, high school GPA, age, and gender. What you certainly shouldn’t control for is student employment, which is a direct result of parental financial support.** Unfortunately, a prominent study does exactly that in most of its analyses (and has not, to my knowledge, been corrected or retracted).

Why is it bad to control for variables that are affected by the *x* you are studying? It leads to biased coefficient estimates – i.e. you get the wrong answer. There is a formal proof of this point in this 2005 paper by Wooldridge. But it’s easy to see the problem using a quick “proof-by-Stata”. *** I’m going to simulate fake data and show that including the outcome variables as controls leads to very wrong answers.

Here is the code to build the fake dataset:

clear all set obs 1000 set seed 346787 gen x = 2*runiform() gen e = rnormal() gen u = rnormal() gen y = x + e gen other_outcome = x^2 + ln(x) + u gen codetermined_outcome = x^2 + ln(x) + e +u

A big advantage here is that I know exactly how *x* affects *y*: the correct coefficient is 1. With real data, we can always argue about what the true answer is.

A simple regression of *y* on *x* gives us the right answer:

reg y x ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.08877 .0564726 19.28 0.000 .9779512 1.199588 _cons | -.096074 .0645102 -1.49 0.137 -.2226653 .0305172 ------------------------------------------------------------------------------

If we control for a codetermined outcome variable then our answer is way off:

reg y x codetermined_outcome ----------------------------------------------------------------------------------- y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------------------+---------------------------------------------------------------- x | -.6710671 .0712946 -9.41 0.000 -.8109717 -.5311624 codetermined_ou~e | .4758925 .0157956 30.13 0.000 .4448961 .5068889 _cons | 1.192013 .0633118 18.83 0.000 1.067773 1.316252 -----------------------------------------------------------------------------------

Controlling for the other outcome variable doesn’t bias our point estimate, but it widens the confidence interval:

reg y x other_outcome ------------------------------------------------------------------------------- y | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- x | 1.084171 .1226962 8.84 0.000 .8433988 1.324944 other_outcome | .0012741 .0301765 0.04 0.966 -.0579426 .0604907 _cons | -.0927479 .1018421 -0.91 0.363 -.2925974 .1071017 -------------------------------------------------------------------------------

Both of these controls cause problems. The real problem is co-determined outcomes – things that are driven by the same unobservable factors that also drive *y*. These will give us the wrong answer on average, and are terrible control variables. (You also shouldn’t control for things that are the direct result of *y*, for the same reason). Other outcomes are bad too – they blow up our standard errors and confidence intervals, because they are highly collinear with *x* and add no new information that is not already in *x*. The safe move is just to avoid controlling for outcomes entirely.

Hey Jason,

The motivation was about a case where x is endogenous (like reg wage education), but x is random in your example. If x is endogenous, I think you may sometimes consider controlling for other outcomes. Try:

replace x = 2*runiform() + e // x endogenous

replace y = x + e

replace codetermined_outcome = x^2 + ln(x) + e + u

reg y x

reg y x codetermined_outcome

Interesting point. I think it might be a function of the dropped observations due to logging negative values of x – when I make the following modification to the code, controlling for codetermined_outcome no longer fixes the bias in the coefficient on x:

replace x = 2*runiform() + e // x endogenous

replace y = x + e

replace codetermined_outcome = x^2 + e + u

reg y x

reg y x codetermined_outcome

Can you get this to work without the missing-values issue? In retrospect I should never have arbitrarily chosen to include a log in my arbitrary functional form.