# The Linear Education Model

To most people, statistics – especially the output of a statistics program like Stata – is just a mess of numbers that don’t mean a whole hell of a lot. This is a big problem, and I believe it emerges from the way statistics (or econometrics, or biostatistics, or whatever a given field’s flavor of stats is known as) is taught, and the way that experts are trained in it.  I have much to say about this broader topic, but today I want to rant about probits and logits. A warning to any non-stats-minded readers: this might get a little technical.

Probits and logits are techniques commonly taught in basic regression analysis courses for handling data with discrete outcomes, e.g. a person’s labor force participation. We used them in ECON 406 when I was teaching it last term, for example. Teaching this way exacerbates all the worst problems with explaining statistics, because instead of the usual table of readily-interpretable numbers you usually get with a linear regression (which tell you the slope of Y with respect to X), the default behavior of Stata and every other stats package I’ve used is to spit out a table of things called “index coefficients”. These have a relationship to the slope we’re interested in, but it’s impossible for regular humans to interpret their magnitude directly and in some cases their sign may even be different. You can force Stata to give you slopes, but it’s not the default behavior because of reasons. So people see a set of misleading numbers and often get fooled; I have seen really smart people misinterpret these.

Despite (or maybe because) probit and logit add a level of inscrutability to regression analysis, many experts really like them. This post by David Giles offers an unconventional defense of probit/logit over linear regression, which in the case of binary outcomes is known as the Linear Probability Model (LPM): that if the outcome is measured with error the parameters of the model are unidentified. Although I must admit I’m not sufficiently skilled in pure econometrics to figure out everything the authors are doing, it turns out that, in the underlying paper by Hausman et al., the authors assume that we know the cdf of the error term to be normal (probit) or logistic (logit). This strikes me as begging the question.

But it’s still odd that they get this result at all since the properties of linear regression are robust to any cdf for the error term. Jorn-Steffen Pischke at Mostly Harmless Econometrics points out that my gut is not wrong: “The structural parameters of a binary choice model, just like the probit index coefficients, are not of particular interest to us. We care about the marginal effects” and the LPM does as good a job approximating them as a non-linear model (A marginal effect is just a slope). This is consistent with my general take on probits and logits, which is that they are better than the LPM only if we happen to know that the true distribution of the error term fits one of those cdfs and also the true model for the index coefficients, which is to say that they are better if we simulate fake data with those properties and in pretty much no other circumstances. My advice to novices running probits and logits is that if the marginal effects differ from the LPM results (or between probit and logit), you had better be pretty confident about that the specific non-linearity you’re imposing is actually true.

Giles’s argument stays away from an even sillier defense of probits and logits, which is that they prevent the model from values of the outcome variable that are more than 1 or less than 0. I’ve always disliked this, because economists and statisticians make a living pretending that discrete variables are continuous. For example, it’s not too uncommon to look at the impact of some intervention on how much schooling a student gets. Schooling is sort of continuous – you could get a fractional year of schooling – but on most surveys, including my own, it’s impossible to report anything but a whole number. It also has definite bounds – less than zero is impossible, and so is more than some top-code on the survey (typically for the end of graduate school). I can run something I call a “linear education model” (LEM) to estimate, but it’s sometimes going to predict negative schooling for some students.  That’s fine – it’s a way of estimating the average change in schooling due to the intervention, not a perfect model of the world. It’s also usually going to return predicted values that are between any two years of schooling, e.g. 11.5. This is also fine – everyone knows what this means, namely that some share of the time a person will end up at any discrete value such that the average is 11.5.

Economists run models like the LEM all the time. Technically we could use a multinomial logit, but nearly everyone agrees that would be overkill. Linear regression works fine. Now imagine we recode schooling into a binary variable, with 0 being “primary or less” and 1 being “secondary or more”. Can we still run a linear regression and learn something of interest? Definitely.