Administrative data in economic research

Timothy Taylor writes,

economic research often goes well beyond these extremely well-known data sources. One big shift has been to the use of “administrative” data, which is a catch-all term to describe data that was not collected for research purposes, but instead developed for administrative reasons. Examples would include tax data from the Internal Revenue Service, data on earnings from the Social Security Administration, data on details of health care spending from Medicare and Medicaid, and education data on teachers and students collected by school districts. There is also private-sector administrative data about issues from financial markets to cell-phone data, credit card data, and “scanner” data generated by cash registers when you, say, buy groceries.

Vilhuber writes: “In 1960, 76% of empirical AER [American Economic Review- articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use …”

The quote is from a paper by Lars Vilhuber.

My thoughts:

1. The Census Bureau has procedures in place for allowing researchers to use administrative data while making sure that no individual record can be identified. I know this because one of my daughters worked at Census in this area for a few years.

2. The private firms that collect data as part of their business are not going to waste resources making sure that the data is free from coding mistakes and other errors. Researchers who are used to assuming that they are working with clean data are going to be surprised.

3. The data surrounding COVID are particularly treacherous. I found this out first-hand back when I was trying to follow the virus crisis closely.

4. This course should be taught more widely. As a point of trivia, the professor, Robert McDonald, was an undergraduate roommate of Russ Roberts (now the host of EconTalk) and also shared an apartment with me our first year of graduate school.

4 thoughts on “Administrative data in economic research

  1. Re: “Researchers who are used to assuming that they are working with clean data are going to be surprised.”,
    My experience is that public government data is not free of errors, and researchers working on private and administrative data are well aware that the data have errors.

    Younger social scientists seem to be putting more and more effort into a good data workflow. I see more use of code sharing, more focus on integrating the data analysis directly into research documents (like knitr and r markdown, and even Stata had this as of version 16), research notebooks (Jupyter and Matlab live editor have these). Services like WRDS, Quandl, and FRED are offering API and database interfaces to their data, allowing for reproducible data requests, and some limited use of computer science techniques like code repositories (like Git) and unit testing. My impression is that things are getting noticably better in these regards and a significant share of this learning is happening through formal instruction at undergraduate and graduate school.

    • I partially agree. Most governmental data is awful, even that provided to the public. I recently worked on a project for a city planning department which needed to know how many housing units the city had (you read that right). I could almost forgive their ignorance because all the available public data was a huge conflicting mess. My experiences with private company data is not much more comforting.

      I disagree that academics and other researchers do much about that. The most common phrase in this particular sausage factory seems to be “this is the best available data” before plowing on. That statement might be true as far as it goes, but if the best available eggs are rotten you should just not make the omelette. Researchers seem ambivalent about the quality of data so long as they can get it to say what they want it to.

      • I agree there is only so much you can do with bad data. Nevertheless, most of the time reproducible research practices and a good data science workflows will raise the overall level of research output as long as the underlying data aren’t completely unfit for purpose.

Comments are closed.