Machine Learning and Holdback Samples

Susan Athey writes,

One common feature of many ML methods is that they use cross-validation to select model complexity; that is, they repeatedly estimate a model on part of the data and then test it on another part, and they find the “complexity penalty term” that fits the data best in terms of mean-squared error of the prediction (the squared difference between the model prediction and the actual outcome).

Pointer from Tyler Cowen.

In the early 1980s, Ed Leamer caused quite a ruckus when he pointed out that nearly all econometricians at that time engaged in specification searches. The statistical validity of multiple regression is based on the assumption that you confront the data only once. Instead, economists would try dozens of specifications until they found one that satisfied their desires for high R-squared and support for their prior beliefs. Because the same data has been re-used over and over, there is a good chance that the process of specification searching leads to spurious relationships.

One possible check on the Leamer problem is to use a holdback sample. That is, you take some observations out of your sample while you do your specification searches on the rest of the data. Then when you are done searching and have your preferred specification, you try it out on the holdback sample. If it still works, then you are fine. If the preferred specification falls apart on the holdback sample, then it indicates that your specification searching produced a spurious relationship.

Machine learning sounds a bit like trying this process over and over again until you get a good fit with the holdback sample. If the holdback sample is a fixed set of data, then this would again lead you to find spurious relationships. Instead, if you randomly select a different holdback sample each time you try a new specification, then I think that the process might be more robust.

I don’t know how it is done in practice.

10 thoughts on “Machine Learning and Holdback Samples

  1. In practice there’s often a third “validation” data set. You have a “training” sample, then you test your trained model on a “test” sample. When you’re all done checking every conceivable model and come up with what you think is the best possible model, you check it against the “validation” set. In my work setting, the datasets I have are too small to break them apart like this. But I’ve heard other professional modelers say they do this (or at least advocate the approach).
    I highly recommend doing Andrew Ng’s Coursera class on machine learning, if you want to pick up on some of these details. I didn’t bother with most of the homework sets, but the videos were extremely enlightening.

    • I took Ng’s Coursera class as well and it was very good. The standard recommended practice is to use 40% of the data to train, 30% to test, and 30% to cross validate. If you don’t have enough data to divide it in this way there is k-fold cross validation where the data is divided into k groups (usually 5 or 10) and then the training is done on (k-1) and tested on the last group. There is a little handwaving that goes on, but mostly the people I’ve talked to are at least aware of the possibility of getting spurious results in various ways.

  2. There is a similar problem even in experimental sciences.

    When a research project begins, its authors record the things that they’re going to measure at the end. “Outcome switching” is when they measure different things instead, without telling anyone. …

    “For instance, imagine a randomised trial on alcohol to determine whether a glass of wine a day is better for your health than abstinence. ‘Better for your health’ is a broad outcome, so you might measure deaths, heart attacks, blood tests, headaches, employment, subjective wellbeing all kinds of things, in all kinds of ways.”

    But if you divide up the data lots of ways, and don’t specify in advance what you’re measuring, you’ll probably get some impressive-sounding result. There’s a lot of random noise in medical studies, meaning that sometimes you’ll get what looks like a real effect just by chance. …

    The group found that 58 out of 67 articles they examined in the five most famous medical journals in the world had switched outcomes to one degree or another.

    From a nice article at http://www.buzzfeed.com/tomchivers/how-science-journals-are-hiding-bad-results?utm_term=.eqkbEJp0Y#.ymzk5VA7Q

  3. Hey Arnold,
    In practice ML techniques are geared towards micro problems (think Netflix recommendations) rather than research/inference problems. With ML techniques You still have the prior of the relative sizes and the sampling method to contend with. Netflix solves this problem by A/B (randomized trials) testing hundreds to thousands of models. In Econ research there is no method to validate priors

  4. Athey mentions “cross-validation” which does not just use one set of data to measure prediction accuracy. For instance 5-fold cross-validation (cv) will divide the training data into 5 equal sized portions, and rotate through using 4 of those portions to fit models and predicting on the 5th portion. This gives it 5 measures of loss (or accuracy or whatever the target metric is). The final model is often chosen based on the average score, or average score minus 1 standard deviation. Then there is yet another hold out set used to predict the generalization error of the final model.

    If the model has hyper-parameters, then nested cross-validation can be used. Hyper-parameters are things like, how many decision trees does my RandomForest have? What’s the maximum depth I’ll allow for my decision trees? Etc.

    Now, even with cross-validation, your point still stands if that final hold out set is used over and over, the benefit of cross-validation will be defeated. In fact, there’s a prediction competition website called Kaggle where this commonly happens. Kaggle competitions have contestants submit predictions for a sample for which they do not know the target values (after training on data where they do have the target values provided). Part of those answers are scored immediately, and the scores determine the “public leaderboard” which gives them immediate feedback, and the rest are saved for the final results. At the end of the competition, contestants are asked to choose their final submissions (say top 3 out potentially dozens of submissions). Often contestants who choose their final submissions based solely on performance on the public leaderboard do poorly on the final results. In that case, the modelers don’t even have the target values in the holdout dataset, and yet they are still able to overfit it based on the feedback from the public leaderboard.

    • +1. My understanding is also that the very best modelers–the ones that consistently win/place in competitions where they are graded on completely out of sample prediction performance–follow this approach. They divide the data into _both_ a set of cross validation folds and a holdout set. Within the cross validation folds, they nest for hyperparameter tuning (and some other reasons as well). They whittle down candidate models using performance on cross validation. It is only when they are ready to select “final answers” from among the best candidates that they rank based on holdout set performance. They also do “meta machine learning” on things like searching the hyperparameter space–to optimize for which strategies tend not to overfit and perform well on out of sample data across ML problems with similar characteristics.

  5. I do this by running a loop and recording the R2 on the holdout and the fitted data, then compare the distributions of the different specifications. Here is some R code that does this, along with some fake data to see how it works. This finction works for the lm and glm functions in R:

    validate = function(model, Data, n, cutoff)
    {

    dependant = paste0(‘data$’, ‘y’)
    dependantHout = paste0(‘dataHout$’, ‘y’)
    out = data.frame(t(c(1, 1)))
    names(out) = c(‘FittedR2’, ‘HoldoutR2’)
    for (i in 1:n)
    {
    Data$rnum = runif(n)
    data = Data[Data$rnum > cutoff,]
    dataHout = Data[Data$rnum <= cutoff,]
    mod = eval(parse(text = model))
    fittedR2 = cor(predict(mod), eval(parse(text = dependant)))^2
    holdoutR2 = cor(predict(mod, newdata = dataHout), eval(parse(text = dependantHout)))^2
    out[i,] = c(fittedR2, holdoutR2)
    }
    out
    }

    x1 = rnorm(100, 0, 1)
    x2 = rnorm(100, 0, 1)
    y = 4*x1 + 10*x2 + 10 + rnorm(100, 0, 1)
    data = data.frame(cbind(y, x1, x2))

    dist = validate('lm(y ~ x1 + x2, data = data)', data, 100, .25)
    hist(dist$HoldoutR2)

    • I should have said, the arguments for the validate function are the model string, the data set, number of model runs, and the percentile cutoff (i.e holdout 25% of the data). The code is turnkey, so you can paste it into R and it will run

  6. Another thing I really like to do is bootstrapping the model, which is just sampling form your data set with replacement, fitting a model and recording the parameters. By doing this several times you get an empirical joint probability distribution of your parameters. By simply taking the mean, or median of your parameters you get a more stable model. Its kind of like regularizing your model.

    Boot = function(data, n, model) {

    # Function to run model
    runMod = function(data, model){

    #execute string with error handling
    mod = tryCatch(
    eval(parse(text = model)),
    error = function(x){“error”}
    )
    out = suppressWarnings(
    #if (mod == “error”) out = “error” else out = coef(mod)
    coef(mod)
    )
    out
    }

    #function to loop through resampled data
    range = c(1:nrow(data)) #row index of dataframe to sample from
    Size = length(range)
    dataSample = data
    print(paste0(“Fitting model 1”, “…..”))
    output = data.frame(runMod(dataSample, model))
    for (i in 2:n){
    print(paste0(“Fitting model “, i, “…..”))
    dataSample = data[sample(range, size = Size, replace = TRUE),]
    mod = runMod(dataSample, model)

    #If the model ran attach the coefficients
    suppressWarnings(
    if (mod != “error”)
    output = data.frame(output, mod[match(row.names(output), names(mod))])
    )

    }
    output = data.frame(t(output)); row.names(output) = NULL
    output
    }

  7. I work in the game space, and we are absolutely careful to make sure we segment our validation set, and we are also careful to make sure that data doesn’t get used multiple times. When dealing with smaller data sets, we may sample and return those samples back to the pool, but generally speaking, we try and avoid over fitting. These are habits built out of our experiences in monetization and avoiding over fitting, but as we move towards using machine learning in various areas, these habits are serving us well.

Comments are closed.