Bad coding in social science research: first in a series

A growing share of the work done by quantitative social science involves what is effectively programming. We have to write code to solve models numerically, to assemble datasets, to transform them between data analysis packages, to clean them, and to analyze them. This is a problem – maybe a huge problem – because most of us (and I’m probably not an exception) are pretty bad coders. We weren’t trained to write code. Most of us have never taken a class in it. I’m passingly familiar with some of the best practices of coding from having lived with legitimate programmers for a stretch, but typical code written by social scientists breaks every rule. Need to do something similar ten times? Let’s copy-paste it ten times and edit each line slightly! That’ll work out great.

It’s hard to know how much this has screwed up the results of research, because releasing code is a new idea and still not widespread. But my UM colleague Joe Golden and I often gripe at all the cases we see, and he sometimes emails me egregious ones he comes across. I’m going to start posting all the ones we find here, so I have a quick list to point to of all the issues out there. Maybe if I have time I’ll even dig up all the famous economics papers where the central result was due to either a confirmed or likely coding error.

The latest case (via Joe) comes from the US Census’s anonymization methods for its release of public data. They do this so they allow researchers to use their data without compromising individuals’ privacy, but somebody screwed up. As a result, some of the statistical properties of the data – which are supposed to be preserved by the anonymization process – were changed, potentially invalidating loads of research.

Leave a Reply

Your email address will not be published. Required fields are marked *