Hal Varian on Big Data

The self-recommending paper is here.

When confronted with a prediction problem of this sort an economist would think immediately of a linear or logistic regression. However, there may be better choices, particularly if a lot of data is available. These include nonlinear methods such as 1) neural nets, 2) support vector machines, 3) classifi cation and regression trees, 4) random forests, and 5) penalized regression such as lasso, lars, and elastic nets.

In one of his examples, he redoes the Boston Fed study that showed that race was a factor in mortgage declines, and using the classification tree method he finds that a tree that omits race as a variable fits the data just as well as a tree that includes race, which implies that race was not an important factor.

Thanks to Mark Thoma for the pointer.

1 thought on “Hal Varian on Big Data

  1. Nonsense jargon aside (support vector machines??), the different methods are a fulfilling what Ed Leamer wrote about 35 years ago. You have to find a way to, first, compute all possible regressions given the data and plausible theory, and then present all possible regression results in a way to establish whether or not there is a robust and meaningful finding. This is “avoid cherry-picking” taken to its logical conclusion.

Comments are closed.