Lively discussion: how to cross-validate?

So today’s group meeting got a bit heated as Nafiz, Ashley, and Xiao touched on the finer points of how to cross validate. Machine learning people, your comments are welcome.

 

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

One Response to “Lively discussion: how to cross-validate?”

  1. Leighton Pritchard says:

    It would be easier to work out what the argument was if people didn’t talk all over eacb other 😉

    Nafiz (I assume that’s him at the board) is correct.

    Given a dataset, and the scheme he’s describing on the board (k-fold CV), you would:

    i) divide the data into two groups in the the proportion 1/k [test] and (k-1)/k [train] (your choice of k is how you decide where the line is)
    ii) train/fit the model on the larger [train] portion
    iii) evaluate the model/fit on the smaller [test] portion

    you would then do this (as he describes) on k mutually-exclusive (k-1)/k and 1/k portions*. You do not train/fit your model on the same data used to test it. The whole purpose of this procedure is to simultaneously exclude information used to evaluate the fit/model from the procedure used to train/fit the model, while allowing the model/fit to be assessed on every point in the dataset. It provides an estimate of the performance of the fit/model, which is what should be reported as the expected performance of the final model/fit, where the final model is trained on the complete dataset.

    However, there is one more subtlety… 😉

    If you are comparing models/fits (which includes parameter choices) to decide which is the best, you should divide your dataset into three groups, proportioned as e.g. 1/k [test], 1/k [hold-out], (k-2)/k [train]. The procedure runs as before with the training and test sets – used for all models/choices – and the estimate of performance obtained as before: but this performance estimate only helps decide between the models/fits/parameter sets. The final reported model/fit performance should instead be evaluated on the hold-out set. This is because the training process (both the model/fit process itself, and choosing between alternative models/fits) has effectively involved both the training and test sets – and you should not employ information that was used to train a model in its assessment (otherwise you risk overfitting).

    It does not matter if your model/fit accuracy is lower in any particular 1/k and (k-1)/k subsetting – you would expect it to sometimes by higher, sometimes lower, than the final model/fit estimate.

    Shameless self-promotion:

    We discuss CV methods briefly in this, but there are links out to more detail: https://www.ncbi.nlm.nih.gov/pubmed/24643551

    Slides 118-126 here should give a graphical explanation of where/how the line is decided: http://www.slideshare.net/leightonp/mining-plant-pathogen-genomes-for-effectors

    *constructed for example, as he says, by shuffling the dataset before you start, and pre-marking the set with k divisions**.

    **noting that, for most appropriate results, you should aim not to truly randomise the dataset, but to have each 1/k block be representative of the datasets as a whole (e.g. if stratified, the proportions of each group should be about the same)