Two recent news stories show how sensitive social science is to issues of data quality. According to John Kennedy and Shi Yaojiang, a large share of the missing women in China actually aren’t missing at all. Instead, their parents and local officials either never registered their births or registered them late. Vincent Galoso reports that Cuba’s remarkable infant mortality rate is partly attributable to doctors re-coding deaths in the first 28 days of life as deaths in the last few weeks of gestation.
Both of these data problems affect important scientific debates. The cost-effectiveness of Cuban health care is the envy of the world and has prompted research into how they do it and discussions of how we should trade off freedom and health. China’s missing women are an even bigger issue. Amartya Sen’s original book on the topic has over 1000 citations, and there are probably dozens of lines of research studying the causes and consequences of missing women – many of whom may in fact not be missing at all.
I am not sure that either of these reports is totally correct. What I am sure about is that each of these patterns must be going on to some extent. If officials in China can hit a heavily-promoted population target by hiding births, of course some of them will do so. Likewise, if parents can avoid a fine by lying about their kids, they are going to do that. And in a patriarchal culture, registering boys and giving them the associated rights makes more sense than registering girls. The same set of incentives holds in Cuba: doctors can hit their infant mortality targets either by improving health outcomes, by preventing less-healthy fetuses from coming to term, or by making some minor changes to paperwork. It stands to reason that people will do the latter at least some of the time.
Morton Jerven points out a similar issue in his phenomenal work Poor Numbers. Macroeconomic data for Africa is based on very spotty primary sources, and the resulting public datasets have errors that are driven by various people’s incentives – even the simple incentive to avoid missing data. These errors have real consequences: there is an extensive literature that uses these datasets to estimate cross-country growth regressions, which have played an important role in policy debates.
At my first job after college, my boss, Grecia Marrufo, told me that variables are only recorded correctly if someone is getting paid to get them right. She was referring to the fact that in health insurance data, lots of stuff isn’t important for payments and so it has mistakes. There is a stronger version of this claim, though: if someone is being coerced to get data wrong, the data will be wrong. And anytime people’s incentives aren’t aligned with getting the right answers, you will get systematic mistakes. I’ve seen this myself while running surveys; due to various intrinsic and extrinsic motivations, enumerators try to finish surveys quickly and end up faking data.
I’m not sure there is anything we can do to prevent fake data from corrupting social scientific research, but I have a couple of ideas that I think would help. First, always cross-check data against other sources when you can. Second, use primary data – and understand how it was collected, by whom, and for what reason – whenever possible. Neither of these can perfectly protect us from chasing fake results down rabbit holes, but they will help a lot. In empirical microeconomics, I have seen a lot of progress on both fronts: important results are debated vigorously and challenged using other data, and more people are collecting their own data. But we still have to be vigilant, and aware of the potential data reporting biases that could be driving results we regard as well-established.